You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/01/06 23:12:54 UTC

[jira] Created: (SOLR-1706) wrong tokens output from WordDelimiterFilter when english possessives are in the text

wrong tokens output from WordDelimiterFilter when english possessives are in the text
-------------------------------------------------------------------------------------

                 Key: SOLR-1706
                 URL: https://issues.apache.org/jira/browse/SOLR-1706
             Project: Solr
          Issue Type: Bug
          Components: Schema and Analysis
    Affects Versions: 1.4
            Reporter: Robert Muir


the WordDelimiterFilter english possessive stemming "'s"  removal (on by default) unfortunately causes strange behavior:

below you can see that when I have requested to only output numeric concatenations (not words), these english possessive stems are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.

{code}
  assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
    new String[] { "42", "AutoCoder" },
    new int[] { 18, 21 },
    new int[] { 20, 30 },
    new int[] { 1, 1 });

  assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
    new String[] { "42", "AutoCoder", "56" },
    new int[] { 18, 21, 33 },
    new int[] { 20, 30, 35 },
    new int[] { 1, 1, 1 });

  assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
    new String[] {  },
    new int[] {  },
    new int[] {  },
    new int[] {  });

  assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
    new String[] { "42" },
    new int[] { 18 },
    new int[] { 20 },
    new int[] { 1 });
{code}

where assertWdf is 
{code}
  void assertWdf(String text, int generateWordParts, int generateNumberParts,
      int catenateWords, int catenateNumbers, int catenateAll,
      int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
      int stemEnglishPossessive, CharArraySet protWords, String expected[],
      int startOffsets[], int endOffsets[], String types[], int posIncs[])
      throws IOException {
    TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
    WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
        generateNumberParts, catenateWords, catenateNumbers, catenateAll,
        splitOnCaseChange, preserveOriginal, splitOnNumerics,
        stemEnglishPossessive, protWords);
    assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
        posIncs);
  }
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1706) wrong tokens output from WordDelimiterFilter when english possessives are in the text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797366#action_12797366 ] 

Robert Muir commented on SOLR-1706:
-----------------------------------

by the way, i do not have a patch here. i am putting the finishing touches on converting this tokenstream to the new tokenstream API so one alternative is to fix it under SOLR-1657.

the problem is i am autogenerating many test cases for all 512 combos of the 9 boolean options across various strings and seeing things like this.

so, at the least i would like agreement that its buggy behavior.. if someone knows how to fix the existing code that would be even better, it would make testing easier on me.


> wrong tokens output from WordDelimiterFilter when english possessives are in the text
> -------------------------------------------------------------------------------------
>
>                 Key: SOLR-1706
>                 URL: https://issues.apache.org/jira/browse/SOLR-1706
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>
> the WordDelimiterFilter english possessive stemming "'s"  removal (on by default) unfortunately causes strange behavior:
> below you can see that when I have requested to only output numeric concatenations (not words), these english possessive stems are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.
> {code}
>   assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42", "AutoCoder" },
>     new int[] { 18, 21 },
>     new int[] { 20, 30 },
>     new int[] { 1, 1 });
>   assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42", "AutoCoder", "56" },
>     new int[] { 18, 21, 33 },
>     new int[] { 20, 30, 35 },
>     new int[] { 1, 1, 1 });
>   assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
>     new String[] {  },
>     new int[] {  },
>     new int[] {  },
>     new int[] {  });
>   assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42" },
>     new int[] { 18 },
>     new int[] { 20 },
>     new int[] { 1 });
> {code}
> where assertWdf is 
> {code}
>   void assertWdf(String text, int generateWordParts, int generateNumberParts,
>       int catenateWords, int catenateNumbers, int catenateAll,
>       int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
>       int stemEnglishPossessive, CharArraySet protWords, String expected[],
>       int startOffsets[], int endOffsets[], String types[], int posIncs[])
>       throws IOException {
>     TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
>     WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
>         generateNumberParts, catenateWords, catenateNumbers, catenateAll,
>         splitOnCaseChange, preserveOriginal, splitOnNumerics,
>         stemEnglishPossessive, protWords);
>     assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
>         posIncs);
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1706) wrong tokens output from WordDelimiterFilter when english possessives are in the text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797829#action_12797829 ] 

Robert Muir commented on SOLR-1706:
-----------------------------------

its not just the concatenation, but also the subword generation.

In the case below, Autocoder should not be emitted, as only numeric subword generation is turned on.

{code}
  public void test128() throws Exception {
    assertWdf("word 1234 Super-Duper-XL500-42-Autocoder x'sbd123 a4b3c-", 0,1,0,0,0,0,0,0,0, null,
      new String[] { "word", "1234", "42", "Autocoder", "a4b3c" },
      new int[] { 0, 5, 28, 31, 50 },
      new int[] { 4, 9, 30, 40, 55 },
      new int[] { 1, 1, 1, 1, 2 });
  }
{code}

> wrong tokens output from WordDelimiterFilter when english possessives are in the text
> -------------------------------------------------------------------------------------
>
>                 Key: SOLR-1706
>                 URL: https://issues.apache.org/jira/browse/SOLR-1706
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>
> the WordDelimiterFilter english possessive stemming "'s"  removal (on by default) unfortunately causes strange behavior:
> below you can see that when I have requested to only output numeric concatenations (not words), these english possessive stems are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.
> {code}
>   assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42", "AutoCoder" },
>     new int[] { 18, 21 },
>     new int[] { 20, 30 },
>     new int[] { 1, 1 });
>   assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42", "AutoCoder", "56" },
>     new int[] { 18, 21, 33 },
>     new int[] { 20, 30, 35 },
>     new int[] { 1, 1, 1 });
>   assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
>     new String[] {  },
>     new int[] {  },
>     new int[] {  },
>     new int[] {  });
>   assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42" },
>     new int[] { 18 },
>     new int[] { 20 },
>     new int[] { 1 });
> {code}
> where assertWdf is 
> {code}
>   void assertWdf(String text, int generateWordParts, int generateNumberParts,
>       int catenateWords, int catenateNumbers, int catenateAll,
>       int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
>       int stemEnglishPossessive, CharArraySet protWords, String expected[],
>       int startOffsets[], int endOffsets[], String types[], int posIncs[])
>       throws IOException {
>     TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
>     WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
>         generateNumberParts, catenateWords, catenateNumbers, catenateAll,
>         splitOnCaseChange, preserveOriginal, splitOnNumerics,
>         stemEnglishPossessive, protWords);
>     assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
>         posIncs);
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1706) wrong tokens output from WordDelimiterFilter depending upon options

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798251#action_12798251 ] 

Yonik Seeley commented on SOLR-1706:
------------------------------------

Yep, certainly bugs.  IMO, no need to worry about trying to match (even for compat) - these look like real configuration edge cases to me.

> wrong tokens output from WordDelimiterFilter depending upon options
> -------------------------------------------------------------------
>
>                 Key: SOLR-1706
>                 URL: https://issues.apache.org/jira/browse/SOLR-1706
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>
> below you can see that when I have requested to only output numeric concatenations (not words), some words are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.
> {code}
>   assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42", "AutoCoder" },
>     new int[] { 18, 21 },
>     new int[] { 20, 30 },
>     new int[] { 1, 1 });
>   assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42", "AutoCoder", "56" },
>     new int[] { 18, 21, 33 },
>     new int[] { 20, 30, 35 },
>     new int[] { 1, 1, 1 });
>   assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
>     new String[] {  },
>     new int[] {  },
>     new int[] {  },
>     new int[] {  });
>   assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42" },
>     new int[] { 18 },
>     new int[] { 20 },
>     new int[] { 1 });
> {code}
> where assertWdf is 
> {code}
>   void assertWdf(String text, int generateWordParts, int generateNumberParts,
>       int catenateWords, int catenateNumbers, int catenateAll,
>       int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
>       int stemEnglishPossessive, CharArraySet protWords, String expected[],
>       int startOffsets[], int endOffsets[], String types[], int posIncs[])
>       throws IOException {
>     TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
>     WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
>         generateNumberParts, catenateWords, catenateNumbers, catenateAll,
>         splitOnCaseChange, preserveOriginal, splitOnNumerics,
>         stemEnglishPossessive, protWords);
>     assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
>         posIncs);
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-1706) wrong tokens output from WordDelimiterFilter depending upon options

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-1706:
------------------------------

    Description: 
below you can see that when I have requested to only output numeric concatenations (not words), some words are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.

{code}
  assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
    new String[] { "42", "AutoCoder" },
    new int[] { 18, 21 },
    new int[] { 20, 30 },
    new int[] { 1, 1 });

  assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
    new String[] { "42", "AutoCoder", "56" },
    new int[] { 18, 21, 33 },
    new int[] { 20, 30, 35 },
    new int[] { 1, 1, 1 });

  assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
    new String[] {  },
    new int[] {  },
    new int[] {  },
    new int[] {  });

  assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
    new String[] { "42" },
    new int[] { 18 },
    new int[] { 20 },
    new int[] { 1 });
{code}

where assertWdf is 
{code}
  void assertWdf(String text, int generateWordParts, int generateNumberParts,
      int catenateWords, int catenateNumbers, int catenateAll,
      int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
      int stemEnglishPossessive, CharArraySet protWords, String expected[],
      int startOffsets[], int endOffsets[], String types[], int posIncs[])
      throws IOException {
    TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
    WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
        generateNumberParts, catenateWords, catenateNumbers, catenateAll,
        splitOnCaseChange, preserveOriginal, splitOnNumerics,
        stemEnglishPossessive, protWords);
    assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
        posIncs);
  }
{code}


  was:
the WordDelimiterFilter english possessive stemming "'s"  removal (on by default) unfortunately causes strange behavior:

below you can see that when I have requested to only output numeric concatenations (not words), these english possessive stems are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.

{code}
  assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
    new String[] { "42", "AutoCoder" },
    new int[] { 18, 21 },
    new int[] { 20, 30 },
    new int[] { 1, 1 });

  assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
    new String[] { "42", "AutoCoder", "56" },
    new int[] { 18, 21, 33 },
    new int[] { 20, 30, 35 },
    new int[] { 1, 1, 1 });

  assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
    new String[] {  },
    new int[] {  },
    new int[] {  },
    new int[] {  });

  assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
    new String[] { "42" },
    new int[] { 18 },
    new int[] { 20 },
    new int[] { 1 });
{code}

where assertWdf is 
{code}
  void assertWdf(String text, int generateWordParts, int generateNumberParts,
      int catenateWords, int catenateNumbers, int catenateAll,
      int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
      int stemEnglishPossessive, CharArraySet protWords, String expected[],
      int startOffsets[], int endOffsets[], String types[], int posIncs[])
      throws IOException {
    TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
    WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
        generateNumberParts, catenateWords, catenateNumbers, catenateAll,
        splitOnCaseChange, preserveOriginal, splitOnNumerics,
        stemEnglishPossessive, protWords);
    assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
        posIncs);
  }
{code}


        Summary: wrong tokens output from WordDelimiterFilter depending upon options  (was: wrong tokens output from WordDelimiterFilter when english possessives are in the text)

> wrong tokens output from WordDelimiterFilter depending upon options
> -------------------------------------------------------------------
>
>                 Key: SOLR-1706
>                 URL: https://issues.apache.org/jira/browse/SOLR-1706
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>
> below you can see that when I have requested to only output numeric concatenations (not words), some words are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.
> {code}
>   assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42", "AutoCoder" },
>     new int[] { 18, 21 },
>     new int[] { 20, 30 },
>     new int[] { 1, 1 });
>   assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42", "AutoCoder", "56" },
>     new int[] { 18, 21, 33 },
>     new int[] { 20, 30, 35 },
>     new int[] { 1, 1, 1 });
>   assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
>     new String[] {  },
>     new int[] {  },
>     new int[] {  },
>     new int[] {  });
>   assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42" },
>     new int[] { 18 },
>     new int[] { 20 },
>     new int[] { 1 });
> {code}
> where assertWdf is 
> {code}
>   void assertWdf(String text, int generateWordParts, int generateNumberParts,
>       int catenateWords, int catenateNumbers, int catenateAll,
>       int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
>       int stemEnglishPossessive, CharArraySet protWords, String expected[],
>       int startOffsets[], int endOffsets[], String types[], int posIncs[])
>       throws IOException {
>     TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
>     WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
>         generateNumberParts, catenateWords, catenateNumbers, catenateAll,
>         splitOnCaseChange, preserveOriginal, splitOnNumerics,
>         stemEnglishPossessive, protWords);
>     assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
>         posIncs);
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (SOLR-1706) wrong tokens output from WordDelimiterFilter depending upon options

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved SOLR-1706.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 3.1
         Assignee: Mark Miller

This was resolved in revision 922957.

> wrong tokens output from WordDelimiterFilter depending upon options
> -------------------------------------------------------------------
>
>                 Key: SOLR-1706
>                 URL: https://issues.apache.org/jira/browse/SOLR-1706
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>            Assignee: Mark Miller
>             Fix For: 3.1
>
>
> below you can see that when I have requested to only output numeric concatenations (not words), some words are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.
> {code}
>   assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42", "AutoCoder" },
>     new int[] { 18, 21 },
>     new int[] { 20, 30 },
>     new int[] { 1, 1 });
>   assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42", "AutoCoder", "56" },
>     new int[] { 18, 21, 33 },
>     new int[] { 20, 30, 35 },
>     new int[] { 1, 1, 1 });
>   assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
>     new String[] {  },
>     new int[] {  },
>     new int[] {  },
>     new int[] {  });
>   assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42" },
>     new int[] { 18 },
>     new int[] { 20 },
>     new int[] { 1 });
> {code}
> where assertWdf is 
> {code}
>   void assertWdf(String text, int generateWordParts, int generateNumberParts,
>       int catenateWords, int catenateNumbers, int catenateAll,
>       int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
>       int stemEnglishPossessive, CharArraySet protWords, String expected[],
>       int startOffsets[], int endOffsets[], String types[], int posIncs[])
>       throws IOException {
>     TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
>     WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
>         generateNumberParts, catenateWords, catenateNumbers, catenateAll,
>         splitOnCaseChange, preserveOriginal, splitOnNumerics,
>         stemEnglishPossessive, protWords);
>     assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
>         posIncs);
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1706) wrong tokens output from WordDelimiterFilter when english possessives are in the text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797466#action_12797466 ] 

Robert Muir commented on SOLR-1706:
-----------------------------------

ok i narrowed this one down some, appears to be unrelated completely to possessives, but some other off-by-one bug:

{code}
public void test0() throws Exception {
  assertWdf("1-a-2 3-b-c-4 5-d-e 6-f", 0,0,0,0,0,0,0,0,0, null,
    new String[] {  },
    new int[] {  },
    new int[] {  },
    new int[] {  });
}

public void test32() throws Exception {
  assertWdf("1-a-2 3-b-c-4 5-d-e 6-f", 0,0,0,1,0,0,0,0,0, null,
    new String[] { "1", "a", "2", "3", "4", "5", "6", "f" },
    new int[] { 0, 2, 4, 6, 12, 14, 20, 22 },
    new int[] { 1, 3, 5, 7, 13, 15, 21, 23 },
    new int[] { 1, 1, 1, 1, 1, 1, 1, 1 });
}
{code}

> wrong tokens output from WordDelimiterFilter when english possessives are in the text
> -------------------------------------------------------------------------------------
>
>                 Key: SOLR-1706
>                 URL: https://issues.apache.org/jira/browse/SOLR-1706
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Robert Muir
>
> the WordDelimiterFilter english possessive stemming "'s"  removal (on by default) unfortunately causes strange behavior:
> below you can see that when I have requested to only output numeric concatenations (not words), these english possessive stems are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.
> {code}
>   assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42", "AutoCoder" },
>     new int[] { 18, 21 },
>     new int[] { 20, 30 },
>     new int[] { 1, 1 });
>   assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42", "AutoCoder", "56" },
>     new int[] { 18, 21, 33 },
>     new int[] { 20, 30, 35 },
>     new int[] { 1, 1, 1 });
>   assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
>     new String[] {  },
>     new int[] {  },
>     new int[] {  },
>     new int[] {  });
>   assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
>     new String[] { "42" },
>     new int[] { 18 },
>     new int[] { 20 },
>     new int[] { 1 });
> {code}
> where assertWdf is 
> {code}
>   void assertWdf(String text, int generateWordParts, int generateNumberParts,
>       int catenateWords, int catenateNumbers, int catenateAll,
>       int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
>       int stemEnglishPossessive, CharArraySet protWords, String expected[],
>       int startOffsets[], int endOffsets[], String types[], int posIncs[])
>       throws IOException {
>     TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
>     WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
>         generateNumberParts, catenateWords, catenateNumbers, catenateAll,
>         splitOnCaseChange, preserveOriginal, splitOnNumerics,
>         stemEnglishPossessive, protWords);
>     assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
>         posIncs);
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.