You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Andrzej Bialecki (Jira)" <ji...@apache.org> on 2019/10/02 15:26:00 UTC

[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching

    [ https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942904#comment-16942904 ] 

Andrzej Bialecki commented on SOLR-13790:
-----------------------------------------

Upon further examination it looks like {{ExactSharedStatsCache}} and {{LRUStatsCache}} have a problem with staleness - they don't track updates in the shards so they have no way of knowing when to refresh the stats. As a result the global stats may be even more wrong than if we used just local stats - imagine a scenario where there's a heavy indexing activity that adds a lot of terms and postings. In this scenario local stats from the local shard would reflect this growth, albeit partially, but the global stats that are stale would not.

Another issue is with the purported optimization in {{LRUStatsCache}} and {{ExactSharedStatsCache}} - the claimed advantage of these caches is that they help to avoid unnecessary fetching of stats from shards. Only they don't ... as explained in my previous comment, both of these implementations always send ShardRequest-s to fetch the stats, thus adding one more round-trip to every query. Since the stats are fetched on every request at least there was no problem with the staleness ;) but the "caching" aspect was completely false - per-shard stats were being fetched on every request, and on every request new global stats would be built and send out.

I plan to address these issues separately, the current patch is already large.

Updated patch with the following additional changes:
 * the biggest change is that now StatsCache instances are tied to SolrIndexSearcher and its life-cycle and not to SolrCore - this helps to at least mitigate the problem of staleness and also the problem of unbound memory consumption of {{ExactSharedStatsCache}}. The downside is that after every commit the cache needs to be re-populated.
 * more optimization and safety in StatsUtil serialization code
 * fixed a bug in {{DebugComponent}} where only local stats would be used for explanations - this threw me off for a while, as I relied on explanations to explain the details of scoring :)
 * added more substance to SolrCloud unit tests

All tests are passing. If there are no objections I'd like to commit this shortly.

> LRUStatsCache size explosion and ineffective caching
> ----------------------------------------------------
>
>                 Key: SOLR-13790
>                 URL: https://issues.apache.org/jira/browse/SOLR-13790
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 7.7.2, 8.2, 8.3
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Critical
>             Fix For: 7.7.3, 8.3
>
>         Attachments: SOLR-13790.patch, SOLR-13790.patch
>
>
> On a sizeable cluster with multi-shard multi-replica collections, when {{LRUStatsCache}} was in use we encountered excessive memory usage, which consequently led to severe performance problems.
> On a closer examination of the heapdumps it became apparent that when {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of {{FastLRUCache}} using the passed {{shard}} argument - however, the value of this argument is not a simple shard name but instead it's a randomly ordered list of ALL replica URLs for this shard.
> As a result, due to the combinatoric number of possible keys, over time the map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries...
> The fix seems to be simply to extract the shard name and cache using this name instead of the full string value of the {{shard}} parameter. Existing unit tests also need much improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org