You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2014/11/26 15:15:12 UTC

[jira] [Updated] (LUCENE-6077) Add a filter cache

     [ https://issues.apache.org/jira/browse/LUCENE-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-6077:
---------------------------------
    Attachment: LUCENE-6077.patch

Here is a patch. It divides the work into 2 pieces:
 - FilterCache whose responsibility is to act as a per-segment cache for filters but doesn't make any decision about which filters should be cached
 - FilterCachingPolicy, whose responsibility is to decide about whether a filter is worth caching given the filter itself, the current segment and the produced (uncached) DocIdSet.

FilterCache has an implementation called LRUFilterCache that accepts a maximum size (number of cached filters) and ram usage and is going to evict least-recently-used filters first. It has some protected methods that allow to configure which impl should be used to cache DocIdSets (RoaringDocIdSet by default), and how to measure ram usage of filters (the default impl uses Accountable#ramBytesUsed if the filter implements Accountable, and falls back to an arbitrary constant (1024) otherwise).

FilterCachingPolicy has an implementation called UsageTrackingFilterCachingPolicy that tries to provide sensible defaults:
 - it tracks the 256 most recently used filters (through their hash codes) globally (not per segment)
 - it only caches on segments whose source is a merge or addIndexes (not flushes)
 - it uses some heuristics to decide how many times a filter should appear in the history of 256 filters in order to be cached.

The filter caching policy can be configured on a per-filter basis, so that even if there are filters that you want to cache more aggressively than others, it is possible to cache them all in a single FilterCache instance.

> Add a filter cache
> ------------------
>
>                 Key: LUCENE-6077
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6077
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 5.0
>
>         Attachments: LUCENE-6077.patch
>
>
> Lucene already has filter caching abilities through CachingWrapperFilter, but CachingWrapperFilter requires you to know which filters you want to cache up-front.
> Caching filters is not trivial. If you cache too aggressively, then you slow things down since you need to iterate over all documents that match the filter in order to load it into an in-memory cacheable DocIdSet. On the other hand, if you don't cache at all, you are potentially missing interesting speed-ups on frequently-used filters.
> Something that would be nice would be to have a generic filter cache that would track usage for individual filters and make the decision to cache or not a filter on a given segments based on usage statistics and various heuristics, such as:
>  - the overhead to cache the filter (for instance some filters produce DocIdSets that are already cacheable)
>  - the cost to build the DocIdSet (the getDocIdSet method is very expensive on some filters such as MultiTermQueryWrapperFilter that potentially need to merge lots of postings lists)
>  - the segment we are searching on (flush segments will likely be merged right away so it's probably not worth building a cache on such segments)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org