You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Stephen O'Donnell (Jira)" <ji...@apache.org> on 2020/09/15 10:46:00 UTC

[jira] [Created] (HDDS-4246) Consider increasing shared RocksDB LRU cache size on datanodes

Stephen O'Donnell created HDDS-4246:
---------------------------------------

             Summary: Consider increasing shared RocksDB LRU cache size on datanodes
                 Key: HDDS-4246
                 URL: https://issues.apache.org/jira/browse/HDDS-4246
             Project: Hadoop Distributed Data Store
          Issue Type: Improvement
          Components: Ozone Datanode
    Affects Versions: 1.1.0
            Reporter: Stephen O'Donnell


By default when a rocksDB instance is opened, a 8MB LRU cache is associated with the instance. From the rocksDB manual, many instances in the same process can share the same LRU cache:

https://github.com/facebook/rocksdb/wiki/Block-Cache

{quote}
A Cache object can be shared by multiple RocksDB instances in the same process, allowing users to control the overall cache capacity.
{quote}

This is of particular interest on the datanodes, where there are potentially thousands of small rocksDB instances.

This RocksDB PR, added a feature to the Java implementation, allowing a LRU cache to be explicitly created and passed to different "Options" objects to ensure the same cache is reused:

{code}
Cache cache = new LRUCache(64 * SizeUnit.MB);
BlockBasedTableConfig table_options = new BlockBasedTableConfig();
table_options.setBlockCache(cache);
Options options = new Options();
options.setCreateIfMissing(true)
    .setStatistics(stats)
    .setTableFormatConfig(table_options);
...
{code}

Before this feature, the way to reuse a cache across many DB instances is to pass the exact same RocksDB Options object when creating the RocksDB instance. This means that a possible unintended side effect of HDDS-2283 (which caches the RocksDB options, and re-uses them across all DB containers) is that there is now only 1 8MB RocksDB cache across all the container RocksDBs on the datanode.

You can see this is the case, by grepping the rocksDB LOG file. Eg, with Option caching, in two containers:

{code}
bash-4.2$ grep -A5 "block_cache:" ./hdds/hdds/2ad8eea5-b9e1-41e1-85eb-8cae745efcb6/current/containerDir0/2/metadata/2-dn-container.db/LOG
  no_block_cache: 0
  block_cache: 0x563ba9088bb0    <=====
  block_cache_name: LRUCache
  block_cache_options:
    capacity : 8388608
    num_shard_bits : 4
    strict_capacity_limit : 0
bash-4.2$ grep -A5 "block_cache:" ./hdds/hdds/2ad8eea5-b9e1-41e1-85eb-8cae745efcb6/current/containerDir0/3/metadata/3-dn-container.db/LOG
  no_block_cache: 0
  block_cache: 0x563ba9088bb0   <=====
  block_cache_name: LRUCache
  block_cache_options:
    capacity : 8388608
    num_shard_bits : 4
    strict_capacity_limit : 0
{code}

Note the block cache in both containers shares the same address "0x563ba9088bb0".

Reverting the caching change, so that a new Options object is passed into the RocksDB instance, we can see the cache address is different:

{code}
bash-4.2$ grep -A5 "block_cache:" ./hdds/hdds/ac115132-9693-4ab9-9d73-dd4bf7e40caf/current/containerDir0/4/metadata/4-dn-container.db/LOG
  no_block_cache: 0
  block_cache: 0x7feec0b86270   <=====
  block_cache_name: LRUCache
  block_cache_options:
    capacity : 8388608
    num_shard_bits : 4
    strict_capacity_limit : 0
bash-4.2$ grep -A5 "block_cache:" ./hdds/hdds/ac115132-9693-4ab9-9d73-dd4bf7e40caf/current/containerDir0/1/metadata/1-dn-container.db/LOG
  no_block_cache: 0
  block_cache: 0x565360926f70   <=====
  block_cache_name: LRUCache
  block_cache_options:
    capacity : 8388608
    num_shard_bits : 4
    strict_capacity_limit : 0
{code}

From this, it is very likely a single 8MB cache for all containers on a large node is not sufficient. We should consider if it makes sense to set a larger shared cache size on the DN, or have several shared caches.

Note that I have not seen any performance issues caused by this, but I came across this when investigating RocksDB in general.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org