You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Dawid Weiss (Jira)" <ji...@apache.org> on 2022/05/16 07:35:00 UTC

[jira] [Resolved] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

     [ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss resolved LUCENE-10541.
----------------------------------
    Fix Version/s: 9.2
       Resolution: Fixed

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> ----------------------------------------------------------------
>
>                 Key: LUCENE-10541
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10541
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Priority: Major
>             Fix For: 9.2
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused: [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the test.
> It's crazy that it took so long for Lucene's randomized tests to discover this too-massive term in Lucene's nightly benchmarks.  It's like searching for Nessie, or [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix {{MockTokenizer}} to trim such ridiculous terms (I think this is the best option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org