You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Toke Eskildsen (JIRA)" <ji...@apache.org> on 2010/04/06 11:53:33 UTC

[jira] Created: (LUCENE-2369) Locale-based sort by field with low memory overhead

Locale-based sort by field with low memory overhead
---------------------------------------------------

Key: LUCENE-2369
URL: https://issues.apache.org/jira/browse/LUCENE-2369
Project: Lucene - Java
Issue Type: New Feature
Components: Search
Reporter: Toke Eskildsen
Priority: Minor

The current implementation of locale-based sort in Lucene uses the FieldCache which keeps all sort terms in memory. Beside the huge memory overhead, searching requires comparison of terms with collator.compare every time, making searches with millions of hits fairly expensive.

This proposed alternative implementation is to create a packed list of pre-sorted ordinals for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This results in very low memory overhead and faster sorted searches, at the cost of increased startup-time. As the ordinals can be resolved to terms after the sorting has been performed, this approach supports fillFields=true.

This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain previous discussions on the subject.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853886#action_12853886 ] 

Toke Eskildsen commented on LUCENE-2369:
----------------------------------------

Earwin, it would be great if this can be done without modifying IndexReaders at all. I know that the getExposedTuples can be refactored out - it'll be somewhat clunky with checks for the class for a given IndexReader and different handling of different readers though. But, however I tweak it, I still need to be able to access a given term by its ordinal from outside the reader. Last time I looked, this was not possible. Has this changed since then?

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps all sort terms in memory. Beside the huge memory overhead, searching requires comparison of terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This results in very low memory overhead and faster sorted searches, at the cost of increased startup-time. As the ordinals can be resolved to terms after the sorting has been performed, this approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853883#action_12853883 ] 

Earwin Burrfoot commented on LUCENE-2369:
-----------------------------------------

I would like to note that new types of sort caches do not require invasive surgery on SegmentReader, of the type seen in previous issue.
Or did I miss some peculiar requirement current APIs do not satisfy?

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps all sort terms in memory. Beside the huge memory overhead, searching requires comparison of terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This results in very low memory overhead and faster sorted searches, at the cost of increased startup-time. As the ordinals can be resolved to terms after the sorting has been performed, this approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853884#action_12853884 ] 

Toke Eskildsen commented on LUCENE-2369:
----------------------------------------

The current implementation accepts Comparator<Object> (which must accept Strings) as well as a Locale (which is converted to Collator.getInstance(locale) under the hoo)d as arguments. Plugging in the ICU collator directly should be trivial. If/when it gets possible to use byte[] for sorters in general, I'll add support for that.

Indexing ICU collator keys and using them in combination with LUCENE-2369 is an interesting idea, as it would speed up the building process quite a lot, while keeping the memory usage down. As long as fillFields=false, the two methods are independent as should work well with each other. Fairly easy to try.

For fillFields=true, it gets a bit trickier and requires a special FieldComparatorSource that keeps two maps from docID: One to the ICU collator key, one to the original term. Still, it should not be that hard to implement and I'll be happy to do it if the fillFields=false-case turns out to work well.

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps all sort terms in memory. Beside the huge memory overhead, searching requires comparison of terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This results in very low memory overhead and faster sorted searches, at the cost of increased startup-time. As the ordinals can be resolved to terms after the sorting has been performed, this approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853885#action_12853885 ] 

Toke Eskildsen commented on LUCENE-2369:
----------------------------------------

Moved from LUCENE-2335 as it really belongs here.

Lotsa devils in the details when you're poking around in the belly of Lucene, but modulo some business with deleted documents, it looks fine for simple (no parallel and multi readers) usage. fillFields=true works just as it should by delaying the actual term resolving until the documents ID's are determined. The current code makes it possible to create an exposed Sort quite easily:
{code}
      ExposedFieldComparatorSource exposedFCS =
          new ExposedFieldComparatorSource(reader, new Locale("da"));
      Sort sort = new Sort(new SortField("mySortField", exposedFCS));
{code}

For the curious, a modified Lucene-JAR can be downloaded at http://github.com/tokee/lucene/downloads and tested with
{code}
java -cp lucene-core-3.1-dev-LUCENE-2335-20100405.jar org.apache.lucene.index.ExposedPOC expose <index> <sortField> <locale> <defaultField>
{code}
this will present the user with a simple shell where searches can be performed. Heap-usage and execution times are displayed along with the search result.

I did a little bit of real world experimenting: A 2.5GB index with 400K documents with 320K unique sort terms took 14 seconds to open. After that, a locale-based sorted search that hit 250K documents and returned the first 50 took 30ms (fully warmed by re-searching 5 times). Max heap was specified to 40MB of which 20MB was used after the building of the sort structure was finished.

The same search using the standard locale-oriented sorter took about 1 second to start up, After that, the 250K search took 130ms, fully warmed. Max heap was specified to 100MB.

The default sorter was able to get by with 80MB, but execution-time increased drastically to 2000ms. Probably because of the GC-overhead that the Collator introduces by temporarily allocating two new objects for each comparison.


The bad news is that this is quite a bit of code (400+ extra lines for the SegmentReader alone) with several levels of indirection in the data structures. As an example, getting the actual term for a given docID in the ExposedFieldComparatorSource is done with
{code}
        final long resolvedDocOrder = docOrder.get(order[slot]);
        return resolvedDocOrder == undefinedTerm ? null : reader.getTermText(
            (int)termOrder.get((int)resolvedDocOrder));
{code}
which is not easily digested without a very thorough explanation, preferably with a diagram.

The API-changes to the IndexReaders is the addition of two methods:
{code}
String getTermText(int ordinal) throws IOException;
{code}
is self-explanatory, but 
{code}
ExposedIterator getExposedTuples(
      String persistenceKey, Comparator<Object> comparator, String field,
      boolean collectDocIDs) throws IOException;
{code}
is real hard to do when writing a new IndexReader. My current approach is to use an interface ExposedReader with the methods and let the updated IndexReaders implement that, thereby making it optional for IndexReaders to be Exposed. 

LUCENE-2335 seems to fulfill the promises so far, with the previously discussed trade-offs:
* Long startup (~1 min/1M documents, less on re-open)
* Fast locale-based sorted search (supports fillFields=true and has near-zero GC overhead)
* Very low memory overhead (both permanent and for initializing)

Regarding making a proper patch, I would like to know what I should patch against. I use LUCENE-1990 so I need to do it against a fairly new version. I can see that Flex is about to be merged, so I guess it would make sense to wait for that one.

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps all sort terms in memory. Beside the huge memory overhead, searching requires comparison of terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This results in very low memory overhead and faster sorted searches, at the cost of increased startup-time. As the ordinals can be resolved to terms after the sorting has been performed, this approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854069#action_12854069 ] 

Toke Eskildsen commented on LUCENE-2369:
----------------------------------------

A few experiments with the current implementation:

40GB index, 5 segments, 7.5M documents, 5.5M unique sort terms, 87M terms total, sort locale da, top-20 displayed with fillFields=true. Just opening the index without any Sort requires 140MB.
Standard sorter: -Xmx1800m, 26 seconds for first search
Exposed sorter: -Xmx350m, 7 minutes for first search (~4½ minutes for segment sorting, ~2½ minutes for merging).

Fully warmed searches, approximate mean:
6.5M hits: standard 2500 ms, exposed 240 ms
4.1M hits: standard 1600 ms, exposed 190 ms
2.1M hits: standard 900 ms, exposed 90 ms
1.2M hits: standard 500 ms, exposed 45 ms
0.5M hits: standard 220 ms, exposed 40 ms
0.1M hits: standard 80 ms, exposed 6 ms
1.7K hits: standard 3 ms, exposed <1 ms

 
2.5GB index, 4 segments, 420K documents, 240K unique sort terms, 11M terms total, sort locale da, top-20 displayed with fillFields=true. Just opening the index without any Sort requires 18MB.
Standard sorter: -Xmx120m, 2 seconds for first search
Exposed sorter: -Xmx50m, 14 seconds for first search (9 seconds for segment sorting, 5 seconds for merging).

Fully warmed searches, approximate mean:
420K hits: standard 170 ms, exposed 15 ms
200K hits: standard 85 ms, exposed 9 ms
100K hits: standard 50 ms, exposed 8 ms
 10K hits: standard 6 ms, exposed 0-1 ms

As can be seen, the timings are fairly consistent for this small ad-hoc test. The difference between standard and exposed sorting is the time it takes for the collator to perform compares. I'll have to test if that can be improved by using a plain int-array to hold the order of the documents, just as the non-locale-using String sorter does.

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps all sort terms in memory. Beside the huge memory overhead, searching requires comparison of terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This results in very low memory overhead and faster sorted searches, at the cost of increased startup-time. As the ordinals can be resolved to terms after the sorting has been performed, this approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853838#action_12853838 ] 

Robert Muir commented on LUCENE-2369:
-------------------------------------

Toke, I still think it would be better to use ICU collation keys here.

for your danish text, the memory usage will still be smaller than trunk (as ICU collation keys as byte[] are smaller than java's internal utf-16 encoding).
then you can do this sort at index-time...

one step towards this is to switch collation in flex to use byte[] rather than encoding in char[] like it does today.


> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache which keeps all sort terms in memory. Beside the huge memory overhead, searching requires comparison of terms with collator.compare every time, making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of pre-sorted ordinals for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This results in very low memory overhead and faster sorted searches, at the cost of increased startup-time. As the ordinals can be resolved to terms after the sorting has been performed, this approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org