You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Toke Eskildsen (JIRA)" <ji...@apache.org> on 2010/09/02 17:14:59 UTC
[jira] Issue Comment Edited: (LUCENE-2369) Locale-based sort by field with low memory overhead

    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905507#action_12905507 ] 

Toke Eskildsen edited comment on LUCENE-2369 at 9/2/10 11:14 AM:
-----------------------------------------------------------------

{quote}
They are just regular terms! you can do a TermQuery on them, sort them as byte[], etc.
its just the bytes use 'collation encoding' instead of 'utf-8 encoding'.
{quote}

Yes, you have stated that repeatedly.

As you are unwilling to answer my questions, I've tweaked my tests to see for myself. It worked very well, as I learned something new. To answer one of the questions, then yes, the terms are loaded into memory when a sort is performed on a field. This is the case for the locale-based sort (yes, I have understood that you do not consider that a viable form of sorting and I only write it for completeness), for STRING-sorting (natural order with ordinal based speedup) and for STRING_VAL-sorting (natural order without the ordinal-speedup). All this against Lucene truck, no patches.

{quote}
This is why i want to factor out the whole 'locale' thing from the issue, since sorting is agnostic to whats in the byte[], its unrelated and it would simplify the issue to just discuss that.
{quote}

For my new tests I've switched to natural order sorting (direct byte[] comparison aka Lucene STRING sorted search without specifying a Locale). The tests should be fairly telling for the different scenarios as the ICU keys should be about the same size as the original terms.

{quote}
    * The apis here are wrong anyway: it shouldnt take Locale but Collator.
      There is no way to set strength or any other options, and theres no way to supply a Collator i made myself (e.g. from RuleBasedCollator)
{quote}

I fully agree. The code I've made so far takes a Comparator<BytesRef>, with optimization for wrapped Collators. The title of LUCENE-2369 is not technically correct but was chosen as "Locale" is a fairly known concept while "Collator" is more complex. That might have been a mistake.

Onwards to testing with natural order (new Sort(new SortField(myField, SortField.STRING))) for Lucene and the hybrid approach (natural order + pre-sorting) for exposed. No ZIPping in the background this time, so measurements differ from the previous test. Heap sizes were measured after a call to System.gc().

2M document index, search hits 1M documents, top 10 hits extracted:

    * Initial exposed search: 0:16 minutes
    * Subsequent exposed searches: 40 ms
    * Total heap usage for Lucene + exposed structure: 20 MB
    * Initial default Lucene search: 0.8 s
    * Subsequent default Lucene searches: 25 ms
    * Total heap usage for Lucene + field cache: 60 MB

20M document index, search hits 10M documents, top 10 hits extracted:

    * Initial exposed search: 2:53 minutes
    * Subsequent exposed searches: 330 ms
    * Total heap usage for Lucene + exposed structure: 154 MB
    * Initial default Lucene search: 6 s
    * Subsequent default Lucene searches: 220 ms
    * Total heap usage for Lucene + field cache: 600 MB

200M document index, search hits 100M documents, top 10 hits extracted:

    * Initial exposed search: 31:33 minutes
    * Subsequent exposed searches: 3200 ms
    * Total heap usage for Lucene + exposed structure: 1660 MB
    * No data for default Lucene search as there was OOM with 7 GB of heap.

What did we learn from this?

   * Natural order search in Lucene with STRING is very fast (as most of the work is ordinal comparison).
   * Exposed sorting is actually slower than natural order (that's news for me). The culprit is a modified PackedInts-structure. I'll look into that.
   * The exposed structure build penalty for the hybrid approach (i.e. relying on natural order instead of doing an explicit sort) was indeed markedly lower than exposed with explicit sorting. A factor 5. I would have expected it to be more though.
   * The hybrid approach uses less than a third of the amount of RAM required by Lucene natural order sorting.

So, Robert, does this answer your challenge "if you have a way to improve memory usage for byte[] terms, lets look just at that?"?

      was (Author: toke):
    {quote}
They are just regular terms! you can do a TermQuery on them, sort them as byte[], etc.
its just the bytes use 'collation encoding' instead of 'utf-8 encoding'.
{quote}

Yes, you have stated that repeatedly.

As you are unwilling to answer my questions, I've tweaked my tests to see for myself. It worked very well, as I learned something new. To answer one of the questions, then yes, the terms are loaded into memory when a sort is performed on a field. This is the case for the locale-based sort (yes, I have understood that you do not consider that a viable form of sorting and I only write it for completeness), for STRING-sorting (natural order with ordinal based speedup) and for STRING_VAL-sorting (natural order without the ordinal-speedup). All this against Lucene truck, no patches.

{quote}
This is why i want to factor out the whole 'locale' thing from the issue, since sorting is agnostic to whats in the byte[], its unrelated and it would simplify the issue to just discuss that.
{quote}

For my new tests I've switched to natural order sorting (direct byte[] comparison aka Lucene STRING sorted search without specifying a Locale). The tests should be fairly telling for the different scenarios as the ICU keys should be about the same size as the original terms.

{quote}
    * The apis here are wrong anyway: it shouldnt take Locale but Collator.
      There is no way to set strength or any other options, and theres no way to supply a Collator i made myself (e.g. from RuleBasedCollator)
{quote}

I fully agree. The code I've made so far takes a Comparator<BytesRef>, with optimization for wrapped Collators. The title of LUCENE-2369 is not technically correct but was chosen as "Locale" is a fairly known concept while "Collator" is more complex. That might have been a mistake.

Onwards to testing with natural order (new Sort(new SortField(myField, SortField.STRING))). No ZIPping in the background this time, so measurements will differ from previous test. Heap sizes were measured after a call to System.gc().

2M document index, search hits 1M documents, top 10 hits extracted:

    * Initial exposed search: 0:16 minutes
    * Subsequent exposed searches: 40 ms
    * Total heap usage for Lucene + exposed structure: 20 MB
    * Initial default Lucene search: 0.8 s
    * Subsequent default Lucene searches: 25 ms
    * Total heap usage for Lucene + field cache: 60 MB

20M document index, search hits 10M documents, top 10 hits extracted:

    * Initial exposed search: 2:53 minutes
    * Subsequent exposed searches: 330 ms
    * Total heap usage for Lucene + exposed structure: 154 MB
    * Initial default Lucene search: 6 s
    * Subsequent default Lucene searches: 220 ms
    * Total heap usage for Lucene + field cache: 600 MB

200M document index, search hits 100M documents, top 10 hits extracted:

    * Initial exposed search: 31:33 minutes
    * Subsequent exposed searches: 3200 ms
    * Total heap usage for Lucene + exposed structure: 1660 MB
    * No data for default Lucene search as there was OOM with 7 GB of heap.

What did we learn from this?

   * Natural order search in Lucene with STRING is very fast (as most of the work is ordinal comparison).
   * Exposed sorting is actually slower than natural order (that's news for me). The culprit is a modified PackedInts-structure. I'll look into that.
   * The exposed structure build penalty for the hybrid approach (e.g. relying on natural order instead of doing an explicit sort) was indeed markedly lower than exposed with explicit sorting. A factor 5. I would have expected it to be more though.
   * The hybrid approach uses less than a third of the amount of RAM required by Lucene natural order sorting.

So, Robert, does this answer your challenge "if you have a way to improve memory usage for byte[] terms, lets look just at that?"?
  
> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>         Attachments: LUCENE-2369.patch
>
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps all sort terms in memory. Beside the huge memory overhead, searching requires comparison of terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This results in very low memory overhead and faster sorted searches, at the cost of increased startup-time. As the ordinals can be resolved to terms after the sorting has been performed, this approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org