You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Dawid Weiss (Jira)" <ji...@apache.org> on 2021/02/07 19:19:00 UTC

[jira] [Created] (LUCENE-9740) Avoid buffering and double-scan of flags in *.aff file

Dawid Weiss created LUCENE-9740:
-----------------------------------

             Summary: Avoid buffering and double-scan of flags in *.aff file
                 Key: LUCENE-9740
                 URL: https://issues.apache.org/jira/browse/LUCENE-9740
             Project: Lucene - Core
          Issue Type: Sub-task
            Reporter: Dawid Weiss
            Assignee: Dawid Weiss


I wrote a small utility test to scan through all the *.aff files from openoffice and woorm - no file has double flags (SET or FLAG) and maximum leading offsets until these flags appear are roughly:
{code}
Flag SET at maximum offset 10753
Flag FLAG at maximum offset 4559
{code}

I think we could just make an assumption that, say, affix files are read with an 20kB buffered reader and this provides a maximum leading window for scanning for those flags. The dictionary parsing could also fail if any of these flags occurs more than once in the input file?

This would avoid having to read the file twice and perhaps simplify the API (no need for a temporary spill).

I'll piggyback this test as part of LUCENE-9727 if you'd like to re-run it locally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org