You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2010/01/09 00:34:54 UTC
[jira] Commented: (SOLR-1706) wrong tokens output from
WordDelimiterFilter depending upon options
[ https://issues.apache.org/jira/browse/SOLR-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798251#action_12798251 ]
Yonik Seeley commented on SOLR-1706:
------------------------------------
Yep, certainly bugs. IMO, no need to worry about trying to match (even for compat) - these look like real configuration edge cases to me.
> wrong tokens output from WordDelimiterFilter depending upon options
> -------------------------------------------------------------------
>
> Key: SOLR-1706
> URL: https://issues.apache.org/jira/browse/SOLR-1706
> Project: Solr
> Issue Type: Bug
> Components: Schema and Analysis
> Affects Versions: 1.4
> Reporter: Robert Muir
>
> below you can see that when I have requested to only output numeric concatenations (not words), some words are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.
> {code}
> assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
> new String[] { "42", "AutoCoder" },
> new int[] { 18, 21 },
> new int[] { 20, 30 },
> new int[] { 1, 1 });
> assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
> new String[] { "42", "AutoCoder", "56" },
> new int[] { 18, 21, 33 },
> new int[] { 20, 30, 35 },
> new int[] { 1, 1, 1 });
> assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
> new String[] { },
> new int[] { },
> new int[] { },
> new int[] { });
> assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
> new String[] { "42" },
> new int[] { 18 },
> new int[] { 20 },
> new int[] { 1 });
> {code}
> where assertWdf is
> {code}
> void assertWdf(String text, int generateWordParts, int generateNumberParts,
> int catenateWords, int catenateNumbers, int catenateAll,
> int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
> int stemEnglishPossessive, CharArraySet protWords, String expected[],
> int startOffsets[], int endOffsets[], String types[], int posIncs[])
> throws IOException {
> TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
> WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
> generateNumberParts, catenateWords, catenateNumbers, catenateAll,
> splitOnCaseChange, preserveOriginal, splitOnNumerics,
> stemEnglishPossessive, protWords);
> assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
> posIncs);
> }
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.