You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Dawid Weiss (Jira)" <ji...@apache.org> on 2021/02/08 21:55:00 UTC

[jira] [Commented] (LUCENE-9740) Avoid buffering and double-scan of flags in *.aff file

    [ https://issues.apache.org/jira/browse/LUCENE-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281405#comment-17281405 ] 

Dawid Weiss commented on LUCENE-9740:
-------------------------------------

This is a tentative look at what I think this can look like, Peter ([~Gromov]). I also added some todos and notes - feel free to improve directly on the PR or just fork your own version!

> Avoid buffering and double-scan of flags in *.aff file
> ------------------------------------------------------
>
>                 Key: LUCENE-9740
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9740
>             Project: Lucene - Core
>          Issue Type: Sub-task
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I wrote a small utility test to scan through all the *.aff files from openoffice and woorm - no file has double flags (SET or FLAG) and maximum leading offsets until these flags appear are roughly:
> {code}
> Flag SET at maximum offset 10753
> Flag FLAG at maximum offset 4559
> {code}
> I think we could just make an assumption that, say, affix files are read with an 20kB buffered reader and this provides a maximum leading window for scanning for those flags. The dictionary parsing could also fail if any of these flags occurs more than once in the input file?
> This would avoid having to read the file twice and perhaps simplify the API (no need for a temporary spill).
> I'll piggyback this test as part of LUCENE-9727 if you'd like to re-run it locally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org