You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Toke Eskildsen (JIRA)" <ji...@apache.org> on 2010/09/01 09:27:54 UTC
[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904957#action_12904957 ] 

Toke Eskildsen commented on LUCENE-2369:
----------------------------------------

{quote}
This doesnt make sense, why do you need the original term also?
{quote}

I was thinking aggregation, but you are right. For aggregation one would of course just use the keys and have no need for the original Strings. Then we're left with federated search.

{quote}
What 'memory overhead'? indexing collation keys, even at tertiary strength (the largest size) is in general less than 2 bytes per character. this is actually less than the cost of a term in ram in lucene 3.1, so i don't understand this?
{quote}

That is the memory overhead. If you have 20M terms of average length 10 chars, that is 400MB in raw bytes and quite a bit more when you're taking pointers into account. When I'm talking memory overhead, my baseline is a newly opened Lucene index without field caching of terms. Having a beefy machine and a small index is trivial. Low-end hardware, virtualized servers and huge indexes all calls for conserving memory. PackedInts helps a lot, avoiding storing the terms (or ICU keys) i RAM helps more.

{quote}
if you are using collation keys, then binary order gives you collated results. So thats what I am hinting at here, is there a more general improvement here you can apply to sorting bytes? If this issue has some ideas that can improve the more general case, I think we should look at factoring those improvements out, and leave the locale stuff as an indexing-time thing.
{quote}

My approach is in theory very simple: Provide an order mapping for each segment, merge the order maps for index level. Having segments with the ICU keys already in sorted order would make the segment maps 1:1 (i.e. they do not require a sort) leaving only the index level merging of order maps. This would halve the build time and together with the 10 times speedup (guessing from the ICU website) that ICU gives, doing the calculation for 20M terms in my example above should take less than a minute. As the caches for the segments are then superfluous, the memory requirements are halved and thus providing faster sort at a memory cost of 100MB as compared to ICU sort speed and 400MB of memory.

{quote}
I don't see this issue really helping if you dont know the locale at index time, by invoking the collator over all the terms at startup you are essentially reindexing in RAM.
{quote}

I fail to see why that is a bad thing if we're looking at the rare scenario of having to postpone the sorting decision to search time. What is the alternative? Right now, search-time collator-based sorting with field cache has low startup time, high memory usage and horrible execution time for large results.

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps all sort terms in memory. Beside the huge memory overhead, searching requires comparison of terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This results in very low memory overhead and faster sorted searches, at the cost of increased startup-time. As the ordinals can be resolved to terms after the sorting has been performed, this approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org