You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Michael McCandless (Jira)" <ji...@apache.org> on 2020/03/03 17:35:00 UTC

[jira] [Commented] (LUCENE-9191) Fix linefiledocs compression or replace in tests

    [ https://issues.apache.org/jira/browse/LUCENE-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050413#comment-17050413 ] 

Michael McCandless commented on LUCENE-9191:
--------------------------------------------

I plan to commit this soon ... it should improve the efficiency of tests using {{LineFileDocs}} since the up-front random seeking is much more efficient now.

These docs are derived from the [Europarl parallel corpus v7|https://www.statmt.org/europarl/], and then randomly split into smallish documents, one per line, and then broken into 20 MB, 200 MB and 2000 MB source files (before compression).  I'll commit the 20 MB file here, along with the Python script that creates the random files.

I also copied all the files up to {{home.apache.org}}: [200 MB|http://home.apache.org/~mikemccand/200mb.txt.gz] (and its [.seek file|http://home.apache.org/~mikemccand/200mb.txt.seek]), and [2000 MB|http://home.apache.org/~mikemccand/2000mb.txt.gz] (and its [.seek file|http://home.apache.org/~mikemccand/2000mb.txt.seek]), in case developers want to test on a wider set of random docs :)

> Fix linefiledocs compression or replace in tests
> ------------------------------------------------
>
>                 Key: LUCENE-9191
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9191
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Robert Muir
>            Assignee: Michael McCandless
>            Priority: Major
>         Attachments: LUCENE-9191.patch, LUCENE-9191.patch
>
>
> LineFileDocs(random) is very slow, even to open. It does a very slow "random skip" through a gzip compressed file.
> For the analyzers tests, in LUCENE-9186 I simply removed its usage, since TestUtil.randomAnalysisString is superior, and fast. But we should address other tests using it, since LineFileDocs(random) is slow!
> I think it is also the case that every lucene test has probably tested every LineFileDocs line many times now, whereas randomAnalysisString will invent new ones.
> Alternatively, we could "fix" LineFileDocs(random), e.g. special compression options (in blocks)... deflate supports such stuff. But it would make it even hairier than it is now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org