You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Toke Eskildsen (JIRA)" <ji...@apache.org> on 2010/08/31 16:50:54 UTC

[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904642#action_12904642 ] 

Toke Eskildsen commented on LUCENE-2369:
----------------------------------------

A status update might be in order. Switching to the current Lucene trunk with flex did require a lot of changes. Luckily it seems that they are all external so this could be a contrib.

The current implementation (not patch yet) seems to scale fairly well: Quick tests were made with a test-index with 5 fields of which one contained random Strings at average length 10 characters. No index optimize. The goal was to perform a Collator-based sorted search with fillFields=true (the terms used for sorting are returned along with the result) to get top-20 out of a lot of hits. Search-time was kept low by a field that was defined with the same term for every other document. The hardware was a Dell M6500 with i7@1.7GHz, PC1333 RAM, Intel X-25G2 SSD. The tests were performed in the background while coding and ZIPping 6M files.

2M document index, search hits 1M documents:
 * Initial exposed search: 1:16 minutes
 * Subsequent exposed searches: 45 ms
 * Total heap usage for Lucene + exposed structure: 21 MB
 * Initial default Lucene search: 3.3 s
 * Subsequent default Lucene searches: 2.1 s
 * Total heap usage for Lucene + field cache: 54 MB

20M document index, search hits 10M documents:
 * Initial exposed search: 15:27 minutes
 * Subsequent exposed searches: 370 ms
 * Total heap usage for Lucene + exposed structure: 209 MB
 * Initial default Lucene search: 28 s
 * Subsequent default Lucene searches: 20 s
 * Total heap usage for Lucene + field cache: 530 MB

200M document index, search hits 100M documents:
 * Initial exposed search: 186:31 minutes
 * Subsequent exposed searches: 3.5 s
 * Total heap usage for Lucene + exposed structure: 2300 MB
 * No data for default Lucene search as there was OOM with 6 GB of heap.

Observations:
 * The memory-requirement for the exposed structures is larger than the strict minimum. This is necessary in order to provide support for fast re-opening of indexes (the order of the terms in unchanged segments is reused). It seems like an obvious option to disable this cache.
 * The time for startup scales about n * log n with the number of terms. Comparing the 200M to 2M: 186 minutes / (200M * log(200M)) * (2M * log(2M)) ~= 1:25 min(observed was 1:16 min).
 * No tests with 100M documents yet, but 1½ hour for build and 1.5GB of RAM would be the expected requirement. Having a startup-time of 1 hour+ is of course excessive, but if one made the calculated structures persistent (and thereby reduced restart time to near zero), this would work well with a classic "update once every night"-scenario. This would provide Collator-sorted search for a 100M document index on a machine with 2GB of RAM.

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps all sort terms in memory. Beside the huge memory overhead, searching requires comparison of terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This results in very low memory overhead and faster sorted searches, at the cost of increased startup-time. As the ordinals can be resolved to terms after the sorting has been performed, this approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org