You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alan Woodward (JIRA)" <ji...@apache.org> on 2018/06/07 07:59:00 UTC
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504369#comment-16504369 ]
Alan Woodward commented on LUCENE-8273:
---------------------------------------
A couple more failing seeds:
{code}
Suite: org.apache.lucene.analysis.core.TestRandomChains
07:41:21 [junit4] 2> TEST FAIL: useCharFilter=true text='t \u0af5\u0a9f\u0acb\u0ada\u0aa6 \u0011\u02eb^ q hnhpwei txslx e \u22c8\u22d9 \u2e06\u2e15\u2e6a\u2e05 uv im i \u1387\u1391\u1398\u1386\u138c j'
07:41:21 [junit4] 2> Exception from random analyzer:
07:41:21 [junit4] 2> charfilters=
07:41:21 [junit4] 2> org.apache.lucene.analysis.MockCharFilter(java.io.StringReader@495f18ce)
07:41:21 [junit4] 2> tokenizer=
07:41:21 [junit4] 2> org.apache.lucene.analysis.core.UnicodeWhitespaceTokenizer(org.apache.lucene.util.AttributeFactory$1@37e10425)
07:41:21 [junit4] 2> filters=ConditionalTokenFilter:
07:41:21 [junit4] 2> org.apache.lucene.analysis.MockGraphTokenFilter(java.util.Random@4d3a58d5, OneTimeWrapper@7c12a0d4 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)ConditionalTokenFilter:
07:41:21 [junit4] 2> org.apache.lucene.analysis.shingle.ShingleFilter(OneTimeWrapper@75d26c0e term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1, <KATAKANA>)
07:41:21 [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=A61F0C126076A16B -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=ar-TN -Dtests.timezone=US/Arizona -Dtests.asserts=true -Dtests.file.encoding=UTF-8
07:41:21 [junit4] ERROR 0.55s J2 | TestRandomChains.testRandomChainsWithLargeStrings <<<
07:41:21 [junit4] > Throwable #1: java.lang.IllegalStateException: last stage: inconsistent endOffset at pos=10: 41 vs 55; token=L寂,釟'Ƈ⨄{𐦠֍
{code}
{code}
Suite: org.apache.lucene.analysis.core.TestRandomChains
15:11:36 [junit4] 2> TEST FAIL: useCharFilter=true text='\u2d8e\u2dbb\u2daf\u2d8b\u2d97\u2dd5\u2d97\u2dcc \u035b\u5996\u07ca\u0003\u12e3\u6450\uf36f '
15:11:36 [junit4] 2> Exception from random analyzer:
15:11:36 [junit4] 2> charfilters=
15:11:36 [junit4] 2> tokenizer=
15:11:36 [junit4] 2> org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer()
15:11:36 [junit4] 2> filters=
15:11:36 [junit4] 2> org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@3f23768a term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1, lbay)ConditionalTokenFilter:
15:11:36 [junit4] 2> org.apache.lucene.analysis.SimplePayloadFilter(OneTimeWrapper@114147eb term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,payload=null)ConditionalTokenFilter:
15:11:36 [junit4] 2> org.apache.lucene.analysis.shingle.ShingleFilter(OneTimeWrapper@680fa1a5 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,payload=null, 7)
15:11:36 [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=49B6EB935C8F8B35 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=nl-NL -Dtests.timezone=BST -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
15:11:36 [junit4] ERROR 0.05s J2 | TestRandomChains.testRandomChains <<<
15:11:36 [junit4] > Throwable #1: java.lang.IllegalStateException: last stage: inconsistent endOffset at pos=9: 12 vs 14; token=ߊ ዣዣ
{code}
I'm looking at these now
> Add a ConditionalTokenFilter
> ----------------------------
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter in such a way that it could optionally be bypassed based on the current state of the TokenStream. This could be used to, for example, only apply WordDelimiterFilter to terms that contain hyphens.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org