You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2014/03/19 06:40:44 UTC

[jira] [Updated] (LUCENE-4641) Fix analyzer bugs documented in TestRandomChains

     [ https://issues.apache.org/jira/browse/LUCENE-4641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-4641:
--------------------------------

    Attachment: LUCENE-4641_tests.patch

unfortunately the situation looks better than it really is.

Somehow in the test framework we conflated 'broken offsets' with 'doesn't respect position length/graphs'. The former is serious (causes people today exceptions when e.g. highlighting), the latter not so much.

This castrated TestRandomChains, because it is never really checking the stuff we care about today. I fixed it in this patch, and things are angry, it seems to fail about 20% of the time. So there is more work to do.


> Fix analyzer bugs documented in TestRandomChains
> ------------------------------------------------
>
>                 Key: LUCENE-4641
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4641
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>         Attachments: LUCENE-4641_tests.patch
>
>
> TestRandomChains.java found a lot of bugs, some of which are hard to fix. So we blacklisted certain analysis components from the test.
> But we really need to fix these, some of these bugs are bad, and they impact users with e.g. highlighting (SOLR-4137 and so on):
> {noformat}
>   // TODO: fix those and remove
>   private static final Set<Class<?>> brokenComponents = Collections.newSetFromMap(new IdentityHashMap<Class<?>,Boolean>());
>   static {
>     // TODO: can we promote some of these to be only
>     // offsets offenders?
>     Collections.<Class<?>>addAll(brokenComponents,
>       // TODO: fix basetokenstreamtestcase not to trip because this one has no CharTermAtt
>       EmptyTokenizer.class,
>       // doesn't actual reset itself!
>       CachingTokenFilter.class,
>       // doesn't consume whole stream!
>       LimitTokenCountFilter.class,
>       // Not broken: we forcefully add this, so we shouldn't
>       // also randomly pick it:
>       ValidatingTokenFilter.class,
>       // NOTE: these by themselves won't cause any 'basic assertions' to fail.
>       // but see https://issues.apache.org/jira/browse/LUCENE-3920, if any 
>       // tokenfilter that combines words (e.g. shingles) comes after them,
>       // this will create bogus offsets because their 'offsets go backwards',
>       // causing shingle or whatever to make a single token with a 
>       // startOffset thats > its endOffset
>       // (see LUCENE-3738 for a list of other offenders here)
>       // broken!
>       NGramTokenizer.class,
>       // broken!
>       NGramTokenFilter.class,
>       // broken!
>       EdgeNGramTokenizer.class,
>       // broken!
>       EdgeNGramTokenFilter.class,
>       // broken!
>       WordDelimiterFilter.class,
>       // broken!
>       TrimFilter.class
>     );
>   }
>   // TODO: also fix these and remove (maybe):
>   // Classes that don't produce consistent graph offsets:
>   private static final Set<Class<?>> brokenOffsetsComponents = Collections.newSetFromMap(new IdentityHashMap<Class<?>,Boolean>());
>   static {
>     Collections.<Class<?>>addAll(brokenOffsetsComponents,
>       ReversePathHierarchyTokenizer.class,
>       PathHierarchyTokenizer.class,
>       HyphenationCompoundWordTokenFilter.class,
>       DictionaryCompoundWordTokenFilter.class,
>       // TODO: corrumpts graphs (offset consistency check):
>       PositionFilter.class,
>       // TODO: it seems to mess up offsets!?
>       WikipediaTokenizer.class,
>       // TODO: doesn't handle graph inputs
>       ThaiWordFilter.class,
>       // TODO: doesn't handle graph inputs
>       CJKBigramFilter.class,
>       // TODO: doesn't handle graph inputs (or even look at positionIncrement)
>       HyphenatedWordsFilter.class,
>       // LUCENE-4065: only if you pass 'false' to enablePositionIncrements!
>       TypeTokenFilter.class,
>       // TODO: doesn't handle graph inputs
>       CommonGramsQueryFilter.class
>     );
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org