You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/12/18 06:54:18 UTC

[jira] Created: (SOLR-1670) synonymfilter/map repeat bug

synonymfilter/map repeat bug
----------------------------

                 Key: SOLR-1670
                 URL: https://issues.apache.org/jira/browse/SOLR-1670
             Project: Solr
          Issue Type: Bug
          Components: Schema and Analysis
    Affects Versions: 1.4
            Reporter: Robert Muir
         Attachments: SOLR-1670_test.patch

as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter

the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
{code}
    // repeats
    map.add(strings("a b"), tokens("ab"), orig, merge);
    map.add(strings("a b"), tokens("ab"), orig, merge);
    assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
    /* in reality the result from getTokList is ab ab ab!!!!! */
{code}

when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792940#action_12792940 ] 

Yonik Seeley commented on SOLR-1670:
------------------------------------

This is my mess that predates solr even going open source... I may have a few neurons that can still recognize what it was trying to do.

The repeats test was testing that repeated rules didn't mess anything up.  In the code you quote, the rule a  b -> ab is added twice.  The rule merging code should handle it (the merging code handles merging all common prefixes).

Robert, can you tell if the flaw is in the test code or the SynonymFilter?  I'm pretty sure that such a bug (in the filter) wasn't originally there... so my money would be on the test - perhaps getTokList, but I haven't had a chance to look yet.

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793655#action_12793655 ] 

Robert Muir commented on SOLR-1670:
-----------------------------------

bq. I don't think assertTokEqual really has a bug - it's written more to match lucene queries and indexes, not to exactly compare one token stream with another. So a singe ab token matches multiple ab tokens at the same position.

Seriously, I think if you want to test these things, then the assertQ etc should be used (actually run queries and test results) instead.

But this is a bug, because aa is not the same as aa,aa(pos=0),aa(pos=0), not even for "queries and indexes".
This is because the latter will affect the score of the search.

I think this is right in line with what you are saying in SOLR-1674, that you have somehow lost some flexibility: This is not true
* these things are different, the tokenstream output is different, things such as score change
* if you don't like this, and instead want to test queries, then do just that, instead of examining tokenstreams.


> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-1670:
-------------------------------

    Attachment: SOLR-1670.patch

Here's a patch that removes duplicate tokens at the same position.
This isn't really a serious bug, as the repeated tokens were at the same position.

I don't think assertTokEqual really has a bug - it's written more to match lucene queries and indexes, not to exactly compare one token stream with another.   So a singe ab token matches multiple ab tokens at the same position.

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792980#action_12792980 ] 

Robert Muir edited comment on SOLR-1670 at 12/20/09 10:33 AM:
--------------------------------------------------------------

bq. Robert, can you tell if the flaw is in the test code or the SynonymFilter? I'm pretty sure that such a bug (in the filter) wasn't originally there... so my money would be on the test - perhaps getTokList, but I haven't had a chance to look yet.

I am pretty sure there is a problem (both the existing test, and the synonymMap merging code).
This is why i think there is some confusion.
The existing test implies that the output will be "ab", which makes me think the SynonymMap should merge these duplicate definitions.
The real output is "ab ab ab", but the test as written (expect "ab") passes with the existing test setup, the problem is assertTokEqual().

the rewritten test in SOLR-1674 looks like: 

{code}
assertTokenizesTo(map, "a b", new String[] { "ab", "ab", "ab"  });
{code}

where assertTokenizesTo is just

{code}
static void assertTokenizesTo(SynonymMap dict, String input,
      String expected[]) throws IOException {
    Tokenizer tokenizer = new WhitespaceTokenizer(new StringReader(input));
    SynonymFilter stream = new SynonymFilter(tokenizer, dict);
    assertTokenStreamContents(stream, expected);
  }
{code}

and assertTokenStreamContents is the one copied from lucene.

Also as mentioned before, all the position increment tests have a similar problem, where both the code and tests are buggy in such a way that we think it is working now.

None of the original positionIncrements are being preserved as the tests imply.



      was (Author: rcmuir):
    bq. Robert, can you tell if the flaw is in the test code or the SynonymFilter? I'm pretty sure that such a bug (in the filter) wasn't originally there... so my money would be on the test - perhaps getTokList, but I haven't had a chance to look yet.

I am pretty sure there is a problem. the rewritten test in SOLR-1674 looks like: 

{code}
assertTokenizesTo(map, "a b", new String[] { "ab", "ab", "ab"  });
{code}

where assertTokenizesTo is just

{code}
static void assertTokenizesTo(SynonymMap dict, String input,
      String expected[]) throws IOException {
    Tokenizer tokenizer = new WhitespaceTokenizer(new StringReader(input));
    SynonymFilter stream = new SynonymFilter(tokenizer, dict);
    assertTokenStreamContents(stream, expected);
  }
{code}

and assertTokenStreamContents is the one copied from lucene.

  
> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792363#action_12792363 ] 

Robert Muir commented on SOLR-1670:
-----------------------------------

btw i don't know how to fix this repeat problem (which is why i only uploaded a test, sorry). I am going to get some sleep and look at it tomorrow, its the last tokenstream to go thru I promise. 

if anyone knows this code better please help

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828204#action_12828204 ] 

Robert Muir commented on SOLR-1670:
-----------------------------------

bq. I left in place the existing test method, which requires the specified order.

is it possible to only expose the 'unsorted' one to synonyms test (such as in the synonyms test file itself, rather than base token stream test case?)

i can't think of another situation where it would make sense, more likely to be abused instead

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806833#action_12806833 ] 

Robert Muir commented on SOLR-1670:
-----------------------------------

Steven, i don't have a problem with your patch (I do not wish to be in the way of anyone trying to work on SynonymFilter)

But i want to explain some of where i was coming from.

The main reason i got myself into this mess was to try to add wordnet support to solr. However, this is currently not possible without duplicating a lot of code.
We need to be really careful about allowing any order, it does matter in some situations.
For example, in Lucene's synonymfilter (with wordnet support), it has an option to limit the number of expansions (so its like a top-N synonym expansion).
Solr doesnt currently have this, so its N/A for now, but just an example where the order suddenly becomes important.

only slightly related: we added some improvements to this assertion in lucene recently and found a lot of bugs, better checking for clearAttribute() and end()
at some I would like to port these test improvements over to solr, too. 


> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792920#action_12792920 ] 

Koji Sekiguchi commented on SOLR-1670:
--------------------------------------

bq. the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.

I agree with you regarding this part. But I'm not sure that the following size() should be 1 in your patch:

{code}
+    assertEquals(1, getTokList(map,"a b",false).size());
{code}

If what "repeats" implies is repeating same term intentionally, I think it can boost tf.

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792980#action_12792980 ] 

Robert Muir edited comment on SOLR-1670 at 12/20/09 11:39 AM:
--------------------------------------------------------------

bq. Robert, can you tell if the flaw is in the test code or the SynonymFilter? I'm pretty sure that such a bug (in the filter) wasn't originally there... so my money would be on the test - perhaps getTokList, but I haven't had a chance to look yet.

I am pretty sure there is a problem (both the existing test, and the synonymMap merging code).
This is why i think there is some confusion.
The existing test implies that the output will be "ab", which makes me think the SynonymMap should merge these duplicate definitions.
The real output is "ab ab ab", but the test as written (expect "ab") passes with the existing test setup, the problem is assertTokEqual().

the rewritten test in SOLR-1674 looks like: 

{code}
assertTokenizesTo(map, "a b", new String[] { "ab", "ab", "ab"  });
{code}

where assertTokenizesTo is just

{code}
static void assertTokenizesTo(SynonymMap dict, String input,
      String expected[]) throws IOException {
    Tokenizer tokenizer = new WhitespaceTokenizer(new StringReader(input));
    SynonymFilter stream = new SynonymFilter(tokenizer, dict);
    assertTokenStreamContents(stream, expected);
  }
{code}

and assertTokenStreamContents is the one copied from lucene.

edit: i think i resolved the additional problems


      was (Author: rcmuir):
    bq. Robert, can you tell if the flaw is in the test code or the SynonymFilter? I'm pretty sure that such a bug (in the filter) wasn't originally there... so my money would be on the test - perhaps getTokList, but I haven't had a chance to look yet.

I am pretty sure there is a problem (both the existing test, and the synonymMap merging code).
This is why i think there is some confusion.
The existing test implies that the output will be "ab", which makes me think the SynonymMap should merge these duplicate definitions.
The real output is "ab ab ab", but the test as written (expect "ab") passes with the existing test setup, the problem is assertTokEqual().

the rewritten test in SOLR-1674 looks like: 

{code}
assertTokenizesTo(map, "a b", new String[] { "ab", "ab", "ab"  });
{code}

where assertTokenizesTo is just

{code}
static void assertTokenizesTo(SynonymMap dict, String input,
      String expected[]) throws IOException {
    Tokenizer tokenizer = new WhitespaceTokenizer(new StringReader(input));
    SynonymFilter stream = new SynonymFilter(tokenizer, dict);
    assertTokenStreamContents(stream, expected);
  }
{code}

and assertTokenStreamContents is the one copied from lucene.

Also as mentioned before, all the position increment tests have a similar problem, where both the code and tests are buggy in such a way that we think it is working now.

None of the original positionIncrements are being preserved as the tests imply.


  
> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792923#action_12792923 ] 

Robert Muir commented on SOLR-1670:
-----------------------------------

Koji, in my opinion the first step towards fixing this would be to resolve SOLR-1674.

This fixes the tests so that they reveal the true current behavior.
Because of the many problems with the current tests, I do not understand what the desired behavior of this filter should even be.
The only thing I know for sure is what the current behavior is.

here is another example:
{code}
    // test that generated tokens start at the same offset as the original
    map.add(strings("a"), tokens("aa"), orig, merge);
    assertTokEqual(getTokList(map,"a,5",false), tokens("aa,5")); /* NOTE: This position increment is really 1, but the test passes!!!! */
{code}

All the tests similar to the above test that position increments are the same as the original token.
But this is not actually what is happening, in fact synonymfilter is generating 'aa' with a position increment of 1.

There are several examples like this in SOLR-1674. What I did was write the assertions so that the tests pass, revealing the true behavior.
But I put comments above these assertions that say what i thought the existing test was trying to show, i.e. the desired behavior.


> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792928#action_12792928 ] 

Koji Sekiguchi commented on SOLR-1670:
--------------------------------------

Robert, sorry, I wanted to say I agree with you regarding "the test for 'repeats' has a flaw". Then "boost TF" was just an input, though I don't know it is intentional feature or side effect.

Why don't you fix the flaws in SynonymFilter test in this ticket first, then fix SOLR-1674? (I've not looked into SOLR-1674 yet.)

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792922#action_12792922 ] 

Robert Muir commented on SOLR-1670:
-----------------------------------

Hi Koji, I don't really understand what you are saying here?

If it is a feature of synonymfilter that it should repeat to boost TF, then the test should have looked like this:

assertTokEqual(getTokList(map,"a b",false), tokens("ab ab ab"));

but it does not, the test looks like this:
assertTokEqual(getTokList(map,"a b",false), tokens("ab"));



> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley reassigned SOLR-1670:
----------------------------------

    Assignee: Yonik Seeley

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829079#action_12829079 ] 

Steven Rowe commented on SOLR-1670:
-----------------------------------

bq. is it possible to expose the 'unsorted' one to synonyms test (such as in the synonyms test file itself, rather than base token stream test case?)

I wrote it the way I did to reduce code duplication/maintenance effort, e.g. the upcoming sync with Lucene's changes in this area.

I'm thinking it should be possible to refactor the current method to serve both the sorted case and an external unsorted case.  I'll work on it.

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated SOLR-1670:
------------------------------

    Attachment: SOLR-1670.patch

New version of the patch, introducing the possibility to test token streams that are allowed to vary the order of their overlapping tokens.  All tests pass for me.

Yonik wrote:
bq. The description of the synonym filter never specifies the order of overlapping tokens, thus the tests should accept either.

Yonik, can you check that the attached patch does what you are suggesting?

Robert wrote:
{quote}
I guess in my opinion, overall its better for the tests to be overly strict, and if in the future we make a valid change to the implementation that breaks a test, we can discuss it during said change, and people can comment on whether this behavior was actually important: for example the aa/a versus a/aa i would probably say not a big deal, but the aa versus aa/aa/aa thing to me is a big deal.

The alternative, being overly lax, presents the possibility of introducing incorrect behavior without being caught, and I think this is very dangerous.
{quote}

Robert, do you think that the attached patch crosses over into overly lax land?

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792930#action_12792930 ] 

Robert Muir commented on SOLR-1670:
-----------------------------------

bq. Why don't you fix the flaws in SynonymFilter test in this ticket first, then fix SOLR-1674? (I've not looked into SOLR-1674 yet.)

Because how would you know I properly fixed it? In fact, if anyone else submitted a patch to fix this issue, I don't really care how many tests they write, I would post a comment voting against the patch, because the tests mean nothing.

The analysis tests themselves (things like this assertTokEqual) are broken, so SOLR-1674 fixes the analysis tests so they actually work.

I think we have to make the tests work correctly, before we can think of fixing nontrivial bugs?


> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793529#action_12793529 ] 

Robert Muir commented on SOLR-1670:
-----------------------------------

Yonik, thanks, I am glad it is not a serious bug!

bq. I don't think assertTokEqual really has a bug - it's written more to match lucene queries and indexes, not to exactly compare one token stream with another.

I see your point about assertTokEqual, but it was being used to test tokenstreams... so I am glad to see it go. 

Mainly want to make sure we don't break anything trying to move this stuff to the new tokenstream API, I am sure this will involve more tests, but for now having well-defined behavior makes it a lot easier. Thanks again

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793213#action_12793213 ] 

Yonik Seeley commented on SOLR-1670:
------------------------------------

Yep, must be a bug in the merging code, never caught because of a bug in assertTokEqual.  I'll look into it.

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829097#action_12829097 ] 

Robert Muir commented on SOLR-1670:
-----------------------------------

bq. Not at the semantic level (for overlapping tokens).

Another way to look at it is that a tokenstream is just a sequence of tokens, and posInc is just another attribute.

your description of semantics makes sense in terms of how it is used by the indexer, but the order of these tokens can matter if someone uses a custom tokenfilter, it might matter for some custom attributes, and it might matter for a different consumer, its different behavior. i have made an effort to preserve all the behavior of all these  tokenstreams when converting to the new api. I really don't want to break anything.


> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792456#action_12792456 ] 

Robert Muir commented on SOLR-1670:
-----------------------------------

note: i have found more bugs in this filter by reworking the tests.

it is impossible to fix the real bugs in the synonyms filter the way the tests are written now, as they do not compare correctly.
I will upload the patch to SOLR-1657 fix all the analysis tests first, so then we have tests that work and can actually start debugging this.
in the comments i will put what the old test "thought" it was testing as a FIXME


> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829088#action_12829088 ] 

Yonik Seeley commented on SOLR-1670:
------------------------------------

bq. is it possible to only expose the 'unsorted' one to synonyms test (such as in the synonyms test file itself, rather than base token stream test case?)

Order of overlapping tokens is unimportant in every TokenFilter used in Solr that I know about. Order-sensitivity is the exception, no?

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793684#action_12793684 ] 

Yonik Seeley commented on SOLR-1670:
------------------------------------

Duplicated tokens aside (that should have been tested another way), I prefer to test the intended semantics rather than the exact behavior (which often ends up testing the exact implementation and ends up being too fragile... valid changes to the implementation break the tests).

The description of the synonym filter never specifies the order of overlapping tokens, thus the tests should accept either.  But as a practical matter, I don't really care about the restrictive token ordering since it's localized to a single method that can be changed as needed later if the tests break.

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829094#action_12829094 ] 

Yonik Seeley commented on SOLR-1670:
------------------------------------

bq. I guess all along my problem is that tokenstreams are ordered by definition. 

Not at the semantic level (for overlapping tokens).  For instance, it would be incorrect for someone to rely on a specific ordering for overlapping tokens - order has never been guaranteed by any TokenFilters that do produce overlapping tokens, and in fact the ordering for some has changed in the past as the implementation changed.


> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829399#action_12829399 ] 

Lance Norskog commented on SOLR-1670:
-------------------------------------

bq. Not at the semantic level (for overlapping tokens). For instance, it would be incorrect for someone to rely on a specific ordering for overlapping tokens - order has never been guaranteed by any TokenFilters that do produce overlapping tokens, and in fact the ordering for some has changed in the past as the implementation changed.

Is this part of the TokenFilter contract documented somewhere?

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828077#action_12828077 ] 

Steven Rowe commented on SOLR-1670:
-----------------------------------

bq. We need to be really careful about allowing any order, it does matter in some situations.

I left in place the existing test method, which requires the specified order.

{quote}
only slightly related: we added some improvements to this assertion in lucene recently and found a lot of bugs, better checking for clearAttribute() and end()
at some I would like to port these test improvements over to solr, too. 
{quote}

+1

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-1670:
------------------------------

    Attachment: SOLR-1670_test.patch

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792980#action_12792980 ] 

Robert Muir commented on SOLR-1670:
-----------------------------------

bq. Robert, can you tell if the flaw is in the test code or the SynonymFilter? I'm pretty sure that such a bug (in the filter) wasn't originally there... so my money would be on the test - perhaps getTokList, but I haven't had a chance to look yet.

I am pretty sure there is a problem. the rewritten test in SOLR-1674 looks like: 

{code}
assertTokenizesTo(map, "a b", new String[] { "ab", "ab", "ab"  });
{code}

where assertTokenizesTo is just

{code}
static void assertTokenizesTo(SynonymMap dict, String input,
      String expected[]) throws IOException {
    Tokenizer tokenizer = new WhitespaceTokenizer(new StringReader(input));
    SynonymFilter stream = new SynonymFilter(tokenizer, dict);
    assertTokenStreamContents(stream, expected);
  }
{code}

and assertTokenStreamContents is the one copied from lucene.


> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>         Attachments: SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793690#action_12793690 ] 

Robert Muir commented on SOLR-1670:
-----------------------------------

bq. Duplicated tokens aside (that should have been tested another way), I prefer to test the intended semantics rather than the exact behavior

then it would be better to use actual queries and such to test this filter, like (some of the) WordDelimiterFilter tests do with assertQ and friends.

But I think things like the order of tokens in the stream is still something i would like to preserve when migrating to the new tokenstream API. Maybe it doesn't seem important but it would still be a change in behavior (probably only break you if you ran positionfilter after, or something like that).

I guess in my opinion, overall its better for the tests to be overly strict, and if in the future we make a valid change to the implementation that breaks a test, we can discuss it during said change, and people can comment on whether this behavior was actually important: for example the aa/a versus a/aa i would probably say not a big deal, but the aa versus aa/aa/aa thing to me is a big deal.

The alternative, being overly lax, presents the possibility of introducing incorrect behavior without being caught, and I think this is very dangerous.


> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated SOLR-1670:
------------------------------

    Attachment: SOLR-1670.patch

I like Yonik's "overlapping tokens" term better than the term I used in my patch ("collocated tokens" - ambiguous, since this is used to refer to phrases rather than shared-position tokens), so I've replaced "collocated" with "overlapping" in variable names in this version of the patch.

Also, I had forgotten to test for the case where more token stream tokens are present at the same position than are expected.  This is now fixed.

I have not made any other changes - in particular, I have not moved the overlapping-token-order-insensitive test capability out of BaseTokenTestCase.

All tests pass for me.

> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829092#action_12829092 ] 

Robert Muir commented on SOLR-1670:
-----------------------------------

bq. Order of overlapping tokens is unimportant in every TokenFilter used in Solr that I know about. Order-sensitivity is the exception, no?

I guess all along my problem is that tokenstreams are ordered by definition. if this order does not matter, a test that uses actual queries would make more sense.

The problem was that the test constructions previously used by this filter were used in other places where they really shouldn't have been, and the laxness hid real bugs (such as this very issue itself!!!).
This is all I am trying to avoid. There is nothing wrong with Steven's patch/test construction, just trying to err on the side of caution.


> synonymfilter/map repeat bug
> ----------------------------
>
>                 Key: SOLR-1670
>                 URL: https://issues.apache.org/jira/browse/SOLR-1670
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Yonik Seeley
>         Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch
>
>
> as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter
> the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.
> {code}
>     // repeats
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     map.add(strings("a b"), tokens("ab"), orig, merge);
>     assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
>     /* in reality the result from getTokList is ab ab ab!!!!! */
> {code}
> when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.