You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ignite.apache.org by "Ivan Bessonov (Jira)" <ji...@apache.org> on 2021/12/10 13:04:00 UTC

[jira] [Created] (IGNITE-16102) Store all RocksDB partitions in a single column family.

Ivan Bessonov created IGNITE-16102:
--------------------------------------

Summary: Store all RocksDB partitions in a single column family.
Key: IGNITE-16102
URL: https://issues.apache.org/jira/browse/IGNITE-16102
Project: Ignite
Issue Type: Improvement
Affects Versions: 3.0.0-alpha3
Reporter: Ivan Bessonov

Current storage implementation puts each partition in its own column family. This effectively means that every partition lives in it's own database, sharing only WAL and some in-memory resources. Given that each column family has multiple files for LSM trees, the amount of opened file descriptors is bigger than it needs to be.

Now, the idea is to have a single column family for partitions within a table. And we should think of possibility of storing several tables in the same RocksDB instance, for similar reasons. You can think about is as of cache groups in Ignite 2.x.

There's also an "optimization" to be implemented that is missing in code - using key hashes as prefixes.
h3. What should be implemented:

First of all, code will be heavily refactored. This will lead to simplifications in many places.

Otherwise, I see the following list of goals to achieve:
* current implementation allows to derive the list of partitions from the list of column families. This won't be possible, I suggest storing this list explicitly in "meta" CF, in any format that'll be convenient during the implementation
* there should be a way of having compact "tableId" representation. IgniteUUID or even UUID is too much I think, but it might work as a basis. This problem should be discussed
* binary representation for keys should now include following information:
** tableId - fixed-length set of bytes to be used as a prefix
** partitionId - 2 bytes that will follow the tableId. This layout will allow making range queries for specific partitions of specific tables
** key hash - 4 bytes. This one is required to optimize comparison time for keys. Generally speaking, it's safe to assume that hashes will be mostly different for different keys, meaning that hashes will be enough to determine keys inequality
** actual key payload goes after all these prefixes

--
This message was sent by Atlassian Jira
(v8.20.1#820001)