You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (Created) (JIRA)" <ji...@apache.org> on 2011/12/26 21:12:30 UTC
[jira] [Created] (LUCENE-3668) offsets issues with multiword
synonyms
offsets issues with multiword synonyms
--------------------------------------
Key: LUCENE-3668
URL: https://issues.apache.org/jira/browse/LUCENE-3668
Project: Lucene - Java
Issue Type: Bug
Components: modules/analysis
Reporter: Robert Muir
as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3668) offsets issues with multiword
synonyms
Posted by "Rahul Babulal (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258775#comment-13258775 ]
Rahul Babulal commented on LUCENE-3668:
---------------------------------------
I'm using solr 3.6, and with luceneMatchVersion =3.6 in my solrconfig.xml I'm still seeing issues with highlighting. However using luceneMatchVersion=3.3 fixes my issue.
Issue Details:
In my synonyms if I have:
nhl, national hockey league
If I index "Australian nhl team great" and
search-use-case 1: search for "hockey" (without quotes) in my highlighted response snippets I get "Australian nhl <em>team</em> great".
search-use-case 2: search for "league" (without quotes) in my highlighted response snippets I get "Australian nhl team <em>great</em>".
Here is my feildType and field definitions:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<field name="description" type="text_synonym" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" omitNorms="false"/>
> offsets issues with multiword synonyms
> --------------------------------------
>
> Key: LUCENE-3668
> URL: https://issues.apache.org/jira/browse/LUCENE-3668
> Project: Lucene - Java
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Robert Muir
> Assignee: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3668) offsets issues with multiword
synonyms
Posted by "Koji Sekiguchi (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176053#comment-13176053 ]
Koji Sekiguchi commented on LUCENE-3668:
----------------------------------------
Thank you for opening this issue, Robert!
bq. Using the old impl imo is no workaround, the offsets are crazy (each individual word gets 0-22).
Good point. Using old impl, if I search for national, the whole phrase of "national hockey league" is highlighted.
bq. But i think we should just leave it be and try to improve the new one.
+1
> offsets issues with multiword synonyms
> --------------------------------------
>
> Key: LUCENE-3668
> URL: https://issues.apache.org/jira/browse/LUCENE-3668
> Project: Lucene - Java
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Robert Muir
> Attachments: LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3668) offsets issues with multiword
synonyms
Posted by "Okke Klein (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506378#comment-13506378 ]
Okke Klein commented on LUCENE-3668:
------------------------------------
Doesn't work for me either in Solr4. Can we revisit this issue?
> offsets issues with multiword synonyms
> --------------------------------------
>
> Key: LUCENE-3668
> URL: https://issues.apache.org/jira/browse/LUCENE-3668
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Robert Muir
> Assignee: Michael McCandless
> Fix For: 3.6, 4.0-ALPHA
>
> Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Updated] (LUCENE-3668) offsets issues with multiword
synonyms
Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-3668:
--------------------------------
Attachment: LUCENE-3668_test.patch
here's 2 tests, one for the old impl, one for the new one:
In this case we have national hockey league = nhl.
desired behavior:
||token||posIncr||startOffset||endOffset||
|national|1|0|8|
|nhl|0|0|*22*|
|hockey|1|9|15|
|league|1|16|22|
current behavior (FST impl):
||token||posIncr||startOffset||endOffset||
|national|1|0|8|
|nhl|0|0|*8*|
|hockey|1|9|15|
|league|1|16|22|
old impl:
||token||posIncr||startOffset||endOffset||
|national|1|0|22|
|nhl|0|0|22|
|hockey|1|0|22|
|league|1|0|22|
>From the offsets perspective, nhl is only getting the offsets of national (endoffset=8) but it would be bettter if it got endoffset=22.
Using the old impl imo is no workaround, the offsets are crazy (each individual word gets 0-22). But i think we should just leave it be and try to improve the new one.
> offsets issues with multiword synonyms
> --------------------------------------
>
> Key: LUCENE-3668
> URL: https://issues.apache.org/jira/browse/LUCENE-3668
> Project: Lucene - Java
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Robert Muir
> Attachments: LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Resolved] (LUCENE-3668) offsets issues with multiword
synonyms
Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless resolved LUCENE-3668.
----------------------------------------
Resolution: Fixed
Fix Version/s: 4.0
3.6
Thanks Koji!
> offsets issues with multiword synonyms
> --------------------------------------
>
> Key: LUCENE-3668
> URL: https://issues.apache.org/jira/browse/LUCENE-3668
> Project: Lucene - Java
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Robert Muir
> Assignee: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3668) offsets issues with multiword
synonyms
Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182018#comment-13182018 ]
Robert Muir commented on LUCENE-3668:
-------------------------------------
+1, I think this is the right solution.
If you have "a b" -> "c d e" or "a b c" -> "d e" then what we are doing now seems good and well-defined,
(i have no idea what else we could even do), but with one output we "know" and the patch improves that case.
> offsets issues with multiword synonyms
> --------------------------------------
>
> Key: LUCENE-3668
> URL: https://issues.apache.org/jira/browse/LUCENE-3668
> Project: Lucene - Java
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Robert Muir
> Assignee: Michael McCandless
> Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Updated] (LUCENE-3668) offsets issues with multiword
synonyms
Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-3668:
---------------------------------------
Attachment: LUCENE-3668.patch
Patch; I think it's ready.
I fixed the syn filter to set the start/end offset only when the right-hand-side of the rule has a single token; else, it does what it did before (inherit start/end offset from the single input token the output is "on top of").
> offsets issues with multiword synonyms
> --------------------------------------
>
> Key: LUCENE-3668
> URL: https://issues.apache.org/jira/browse/LUCENE-3668
> Project: Lucene - Java
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Robert Muir
> Assignee: Michael McCandless
> Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3668) offsets issues with multiword
synonyms
Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506554#comment-13506554 ]
Robert Muir commented on LUCENE-3668:
-------------------------------------
That writeup is a little off.
{quote}
Finally, and most seriously, the SynonymFilterFactory will simply not match multi-word synonyms in user queries if you do any kind of tokenization. This is because the tokenizer breaks up the input before the SynonymFilterFactory can transform it.
{quote}
Thats not correct. The bug is in QueryParser: LUCENE-2605.
> offsets issues with multiword synonyms
> --------------------------------------
>
> Key: LUCENE-3668
> URL: https://issues.apache.org/jira/browse/LUCENE-3668
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Robert Muir
> Assignee: Michael McCandless
> Fix For: 3.6, 4.0-ALPHA
>
> Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Assigned] (LUCENE-3668) offsets issues with multiword
synonyms
Posted by "Michael McCandless (Assigned) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless reassigned LUCENE-3668:
------------------------------------------
Assignee: Michael McCandless
> offsets issues with multiword synonyms
> --------------------------------------
>
> Key: LUCENE-3668
> URL: https://issues.apache.org/jira/browse/LUCENE-3668
> Project: Lucene - Java
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Robert Muir
> Assignee: Michael McCandless
> Attachments: LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3668) offsets issues with multiword
synonyms
Posted by "Okke Klein (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506378#comment-13506378 ]
Okke Klein edited comment on LUCENE-3668 at 11/29/12 12:36 PM:
---------------------------------------------------------------
Doesn't work for me either in Solr4. Can we revisit this issue?
Perhaps this http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ can give some insight/help?
was (Author: okkeklein):
Doesn't work for me either in Solr4. Can we revisit this issue?
> offsets issues with multiword synonyms
> --------------------------------------
>
> Key: LUCENE-3668
> URL: https://issues.apache.org/jira/browse/LUCENE-3668
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Robert Muir
> Assignee: Michael McCandless
> Fix For: 3.6, 4.0-ALPHA
>
> Attachments: LUCENE-3668.patch, LUCENE-3668_test.patch
>
>
> as reported on the list, there are some strange offsets with FSTSynonyms, in the case of multiword synonyms.
> as a workaround it was suggested to use the older synonym impl, but it has bugs too (just in a different way).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org