You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alan Woodward (JIRA)" <ji...@apache.org> on 2018/06/07 07:59:00 UTC
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

    [ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504369#comment-16504369 ] 

Alan Woodward commented on LUCENE-8273:
---------------------------------------

A couple more failing seeds:
{code}
Suite: org.apache.lucene.analysis.core.TestRandomChains
07:41:21    [junit4]   2> TEST FAIL: useCharFilter=true text='t \u0af5\u0a9f\u0acb\u0ada\u0aa6 \u0011\u02eb^ q hnhpwei txslx  e \u22c8\u22d9 \u2e06\u2e15\u2e6a\u2e05 uv im i \u1387\u1391\u1398\u1386\u138c  j'
07:41:21    [junit4]   2> Exception from random analyzer: 
07:41:21    [junit4]   2> charfilters=
07:41:21    [junit4]   2>   org.apache.lucene.analysis.MockCharFilter(java.io.StringReader@495f18ce)
07:41:21    [junit4]   2> tokenizer=
07:41:21    [junit4]   2>   org.apache.lucene.analysis.core.UnicodeWhitespaceTokenizer(org.apache.lucene.util.AttributeFactory$1@37e10425)
07:41:21    [junit4]   2> filters=ConditionalTokenFilter: 
07:41:21    [junit4]   2>   org.apache.lucene.analysis.MockGraphTokenFilter(java.util.Random@4d3a58d5, OneTimeWrapper@7c12a0d4 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)ConditionalTokenFilter: 
07:41:21    [junit4]   2>   org.apache.lucene.analysis.shingle.ShingleFilter(OneTimeWrapper@75d26c0e term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1, <KATAKANA>)
07:41:21    [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=A61F0C126076A16B -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=ar-TN -Dtests.timezone=US/Arizona -Dtests.asserts=true -Dtests.file.encoding=UTF-8
07:41:21    [junit4] ERROR   0.55s J2 | TestRandomChains.testRandomChainsWithLargeStrings <<<
07:41:21    [junit4]    > Throwable #1: java.lang.IllegalStateException: last stage: inconsistent endOffset at pos=10: 41 vs 55; token=L寂,釟'Ƈ⨄{𐦠֍
{code}
{code}
Suite: org.apache.lucene.analysis.core.TestRandomChains
15:11:36    [junit4]   2> TEST FAIL: useCharFilter=true text='\u2d8e\u2dbb\u2daf\u2d8b\u2d97\u2dd5\u2d97\u2dcc \u035b\u5996\u07ca\u0003\u12e3\u6450\uf36f '
15:11:36    [junit4]   2> Exception from random analyzer: 
15:11:36    [junit4]   2> charfilters=
15:11:36    [junit4]   2> tokenizer=
15:11:36    [junit4]   2>   org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer()
15:11:36    [junit4]   2> filters=
15:11:36    [junit4]   2>   org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@3f23768a term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1, lbay)ConditionalTokenFilter: 
15:11:36    [junit4]   2>   org.apache.lucene.analysis.SimplePayloadFilter(OneTimeWrapper@114147eb term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,payload=null)ConditionalTokenFilter: 
15:11:36    [junit4]   2>   org.apache.lucene.analysis.shingle.ShingleFilter(OneTimeWrapper@680fa1a5 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,payload=null, 7)
15:11:36    [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=49B6EB935C8F8B35 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=nl-NL -Dtests.timezone=BST -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
15:11:36    [junit4] ERROR   0.05s J2 | TestRandomChains.testRandomChains <<<
15:11:36    [junit4]    > Throwable #1: java.lang.IllegalStateException: last stage: inconsistent endOffset at pos=9: 12 vs 14; token=ߊ ዣዣ
{code}
I'm looking at these now

> Add a ConditionalTokenFilter
> ----------------------------
>
>                 Key: LUCENE-8273
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8273
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: 7.4
>
>         Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter in such a way that it could optionally be bypassed based on the current state of the TokenStream.  This could be used to, for example, only apply WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org