You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Tomoko Uchida (Jira)" <ji...@apache.org> on 2022/04/09 06:33:00 UTC

[jira] [Comment Edited] (LUCENE-10359) KoreanTokenizer: TestRandomChains fails with incorrect offsets

    [ https://issues.apache.org/jira/browse/LUCENE-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519881#comment-17519881 ] 

Tomoko Uchida edited comment on LUCENE-10359 at 4/9/22 6:32 AM:
----------------------------------------------------------------

I noticed this issue when looking at the KoreanTokenizer. We are aggressively refactoring KoreanTokenizer in [LUCENE-10393] and [LUCENE-10493], I'd like to enable it on `TestRandomChains` as with JapaneseTokenizer. According to the stacktrace, the problem is KoreanNumbefFilter, not KoreanTokenizer... 
To give it a try, can I remove the class-level {{@IgrnoreRandomChains}} from it?


was (Author: tomoko uchida):
I noticed this issue when looking at the KoreanTokenizer. We are aggressively refactoring KoreanTokenizer in [LUCENE-10393] and [LUCENE-10485], I'd like to enable it on `TestRandomChains` as with JapaneseTokenizer. According to the stacktrace, the problem is KoreanNumbefFilter, not KoreanTokenizer... 
To give it a try, can I remove the class-level {{@IgrnoreRandomChains}} from it?

> KoreanTokenizer: TestRandomChains fails with incorrect offsets
> --------------------------------------------------------------
>
>                 Key: LUCENE-10359
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10359
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Uwe Schindler
>            Priority: Major
>              Labels: random-chains
>
> It looks like KoreanTokenizer is causing this (NORI), but Kuromoji may be affected in the same way:
> {noformat}
> org.apache.lucene.analysis.tests.TestRandomChains > test suite's output saved to C:\Users\Uwe Schindler\Projects\lucene\lucene\lucene\analysis\integration.tests\build\test-results\test\outputs\OUTPUT-org.apache.lucene.analysis.tests.TestRandomChains.txt, copied below:
>   2> stage 0: e<[2-3] +1> ek<[4-6] +1> oy<[8-10] +1> 1<[11-12] +1> zzkuxp<[13-19] +1>
>   2> stage 1: e<[2-3] +1> ek<[4-6] +1> oy<[8-10] +1> 1<[11-12] +1> zzkuxp<[13-19] +1>
>   2> stage 2: e<[2-3] +1> e ek<[2-6] +0> ek<[4-6] +1> ek oy<[4-10] +0> oy<[8-10] +1> oy 1<[8-12] +0> 1<[11-12] +1> 1 zzkuxp<[11-19] +0>
>   2> stage 3: e<[2-3] +1> e ek<[2-6] +0> ek<[4-6] +1> ek oy<[4-10] +0> oy<[8-10] +1> oy 1<[8-12] +0> 1<[11-12] +1> 1 zzkuxp<[11-19] +0>
>   2> last stage: e<[2-3] +1> e ek<[2-6] +0> ek<[4-6] +1> ek oy<[4-10] +0> oy<[8-10] +1> oy 1<[8-12] +0> 1 zzkuxp<[11-19] +0>
>   2> TEST FAIL: useCharFilter=false text='?.e|ek|]oy{1 zzkuxp ZyzzV ycuqjnv axtpppvk \u233b\u23c8\u2314\u232e\u236e\u238d\u235e x d  \"</p>'
>   2> Exception from random analyzer:
>   2> charfilters=
>   2>   org.apache.lucene.analysis.pattern.PatternReplaceCharFilter(a, ifywufhi, java.io.StringReader@48586999)
>   2>   org.apache.lucene.analysis.charfilter.MappingCharFilter(org.apache.lucene.analysis.charfilter.NormalizeCharMap@65036838, org.apache.lucene.analysis.pattern.PatternReplaceCharFilter@11d4ba35)
>   2> tokenizer=
>   2>   org.apache.lucene.analysis.ko.KoreanTokenizer()
>   2> filters=
>   2>   org.apache.lucene.analysis.en.KStemFilter(ValidatingTokenFilter@595d7938 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=null,leftPOS=null,rightPOS=null,morphemes=null,reading=null,keyword=false)
>   2>   org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@13d08b48 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=null,leftPOS=null,rightPOS=null,morphemes=null,reading=null,keyword=false, u)
>   2>   org.apache.lucene.analysis.util.ElisionFilter(ValidatingTokenFilter@6396b917 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=null,leftPOS=null,rightPOS=null,morphemes=null,reading=null,keyword=false, [fh, hiiwwxyyd, fcpodqor, qogvhmywr, l, icad])
>   2>   Conditional:org.apache.lucene.analysis.ko.KoreanNumberFilter(OneTimeWrapper@5f0558f6 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=null,leftPOS=null,rightPOS=null,morphemes=null,reading=null,keyword=false)
>    >     java.lang.IllegalStateException: last stage: inconsistent startOffset at pos=2: 8 vs 11; token=1 zzkuxp
>    >         at __randomizedtesting.SeedInfo.seed([E4552C7844FC2DA3:8E0E93691DB20D50]:0)
>    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:138)
>    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:1130)
>    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:1028)
>    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:922)
>    >         at org.apache.lucene.analysis.tests@10.0.0-SNAPSHOT/org.apache.lucene.analysis.tests.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:943)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org