You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Martijn van Groningen (JIRA)" <ji...@apache.org> on 2015/05/22 15:17:17 UTC

[jira] [Updated] (LUCENE-6496) Updatable OrdinalMap

     [ https://issues.apache.org/jira/browse/LUCENE-6496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-6496:
------------------------------------------
    Attachment: LUCENE-6496.patch

Attached an initial patch:
1. Added a common interface for OrdinalMap.
2. Pulled the code from MultiDocValues.OrdinalMap to a concrete impl. called ImmutableOrdinalMap. I didn't yet remove MultiDocValues.OrdinalMap in order to keep this patch small. (otherwise code that uses it would also need to get modified)
3. Added an UpdatableOrdinalMap impl that wraps an ImmutableOrdinalMap, but keeps track of changes. It holds the segment core keys for all segments and for new segments (created from the second ordinal build) it holds to segment ordinal to global ordinal lookup.

The UpdatableOrdinalMap as is now rebuilds if:
* A new term has been introduced.
* A segment that was previously known has disappeared.
* A new segment contains more than 128 unique values or the segment value ratio to the index value ratio is higher than 0.1 In total 20 or more segments are going to be reopened. These heuristics and defaults need to be verified and benchmarked. 

I still need to test the performance of the UpdatableOrdinalMap. There is a cost in looking up if a term already has a global ordinal and I need to figure out when it is okay to pay for this and when it is just better to rebuild completely.

> Updatable OrdinalMap 
> ---------------------
>
>                 Key: LUCENE-6496
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6496
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Martijn van Groningen
>            Priority: Minor
>         Attachments: LUCENE-6496.patch
>
>
> The MultiDocValues.OrdinalMap that we have to today requires a rebuild on each reopen. When the OrdinalMap has been built, lookups are fast and the logic is simple. Many time rebuilding the the OrdinalMap isn't even an issue, because for low to medium cardinality fields the rebuilding doesn't take that much time. The time required to build the OrdinalMap depends on the number of unique terms in a field.
> For high cardinality fields (lets say >= 1M terms) rebuilding the OrdinalMap can take some time to complete. This can then impact the NRT aspect of many applications (facets may rely on ordinal maps to be rebuilt before a new search can happen after the reopen).
> I like to explore a different OrdinalMap implementation that doesn't need to be rebuilt on each reopen. There are simple improvements that can made:
> * Lets say docs have only been marked as deleted, then we basically reuse the OrdinalMap that has already been built. 
> * If no new terms have been introduced we can just add segment ordinal to global ordinal lookups to the OrdinalMap that has already been built.
> I think a complete OrdinalMap rebuild is inevitable, but it would be great if we could rebuild on a flush / merge instead of on each reopen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org