You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Ivan Bessonov (Jira)" <ji...@apache.org> on 2023/04/07 12:05:00 UTC

[jira] [Updated] (IGNITE-19133) Increase partitions count upper bound

     [ https://issues.apache.org/jira/browse/IGNITE-19133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ivan Bessonov updated IGNITE-19133:
-----------------------------------
    Description: 
h3. Problem

Data partitioning is used to distribute data (hopefully) evenly across the cluster and to provide necessary operation parallelism for the end user. As a rule of thumb, one may consider allocating 256 partitions per Ignite node, in order to achieve that.

This rule only scales up to a certain point. Imagine a cluster of 1000 nodes, with a table that has 3 replicas of each partition (ability to lose 1 backup). With current limit of 65500 partitions, the maximal number of partitions per node would be {{{}65500*3/1000 ~= 196{}}}. This is the limit of our scalability, according to aforementioned rule. To provide 256 partitions per node, the user would have to:
 * either increase the number of backups, which proportionally increases required storage space (affects cost),
 * or increase the total number of partitions up to about 85 thousands. This is not possible right now.

h3. What's the reason of current limit

Disclaimer: I'm not the one who designed it, so my thoughts may be speculative in some sense.

Short answer is: we need a number of partitions to fit into 2 bytes.

Long answer: in current implementation we have 1 to 1 correspondence between logical partition id and physical partition id. We use the same value both in affinity and in physical file name. This makes system simpler, and I believe that simplicity is the real explanation of the restriction.

Why does it have to be 2 bytes, and not 3, for example. The key is the structure of page identifiers in data regions:
{code:java}
+---------------------------+----------------+---------------------+-----------------+
| rotation/item id (1 byte) | flags (1 byte) | partition (2 bytes) | index (4 bytes) |
+---------------------------+----------------+---------------------+-----------------+{code}
The idea was to fit it into 8 bytes. Good idea, in my opinion. Making it bigger doesn't feel right.
h3. Proposed solution

As mentioned, there are to components to the problem:
 # One to one correspondence between partition ids.
 # Hard limit in a single data region, caused by the page id layout.

There's not much we can do with component #2, because the implications are unpredictable, and the amount of code we would need to fix is astonishing.
h4. More reasonable restrictions

This leads us to the following problem: every single Ignite node can't have more than 65500 partitions for a table (or distribution zone). So, imagine the situation:
 * user has a cluster with 3 nodes
 * user tries to create distribution zone with 3 nodes, 3 replicas for each partitions and 100000 partitions

While this is absurd, the configuration is still "valid", but it leads to 100k partitions per node, which is impossible.

Such zone configurations must be banned. Such restriction doesn't seem unreasonable. If a user wants to start so many partitions for such a small cluster, they really don't understand what they're doing.

This naturally gives us a minimal number of nodes per the number of partitions, as the following formula (assuming that you can't have 2 replicas of the same partition on the same Ignite node):
{code:java}
nodes >= min(replicas, ceil(partitions * replicas / 65500))
{code}
This estimation is imprecise, because it assumes perfect distribution. In reality, rendezvous affinity is uneven, so the real value must be checked when user configures the number of nodes for specific distribution zone.
h4. Ties to rebalance

For this question I would probably need an assistance. While affinity reassignment, each node may store more partition then it's stated in every single distribution. What do I mean by this:
 * imagine node having partitions 1, 2, and 3
 * after the reassignment, the node has partitions 3, 4 and 5

Each individual distribution states that node only has 3 partitions, while during the rebalance, it may store all 5: sending 1 and 2 to some node, and receiving 4 and 5 from some different node.

Multiply that by a big factor, and it is possible to have situation, where local number of partitions exceeds 65500. The only way to beat it, in my opinion, is to lower the hard limit in affinity function to 32xxx per node, leaving a space for partitions in a MOVING state.
h4. Mapping partition ids

With that being said, all that's left is to map logical partition ids from the range 0..N (where N is unlimited) to physical ids from the range 0..65500.

Such mapping is a local entity, encapsulated deep inside of the storage engine. Simplest way to do so is to have a HashMap \{ logical -> physical } and to increase physical partition id by 1 every time you insert a new value. If the {{values()}} set is not continuous, one may occupy the gap, it's not too hard to implement.

Of course, this correspondence must be persisted with data. When we read physical partition from the storage, we must be able to know its logical id.

  was:
h3. Problem

Data partitioning is used to distribute data (hopefully) evenly across the cluster and to provide necessary operation parallelism for the end user. As a rule of thumb, one may consider allocating 256 partitions per Ignite node, in order to achieve that.

This rule only scales up to a certain point. Imagine a cluster of 1000 nodes, with a table that has 3 replicas of each partition (ability to lose 1 backup). With current limit of 65500 partitions, the maximal number of partitions per node would be {{{}65500*3/1000 ~= 196{}}}. This is the limit of our scalability, according to aforementioned rule. To provide 256 partitions per node, the user would have to:
 * either increase the number of backups, which proportionally increases required storage space (affects cost),
 * or increase the total number of partitions up to about 85 thousands. This is not possible right now.

h3. What's the reason of current limit

Disclaimer: I'm not the one who designed it, so my thoughts may be speculative in some sense.

Short answer is: we need a number of partitions to fit into 2 bytes.

Long answer: in current implementation we have 1 to 1 correspondence between logical partition id and physical partition id. We use the same value both in affinity and in physical file name. This makes system simpler, and I believe that simplicity is the real explanation of the restriction.

Why does it have to be 2 bytes, and not 3, for example. The key is the structure of page identifiers in data regions:
{code:java}
+---------------------------+----------------+---------------------+-----------------+
| rotation/item id (1 byte) | flags (1 byte) | partition (2 bytes) | index (4 bytes) |
+---------------------------+----------------+---------------------+-----------------+{code}
The idea was to fit it into 8 bytes. Good idea, in my opinion. Making it bigger doesn't feel right.
h3. Proposed solution

As mentioned, there are to components to the problem:
 # One to one correspondence between partition ids.
 # Hard limit in a single data region, caused by the page id layout.

There's not much we can do with component #2, because the implications are unpredictable, and the amount of code we would need to fix is astonishing.
h4. More reasonable restrictions

This leads us to the following problem: every single Ignite node can't have more than 65500 partitions for a table (or distribution zone). So, imagine the situation:
 * user has a cluster with 3 nodes
 * user tries to create distribution zone with 3 nodes, 3 replicas for each partitions and 100000 partitions

While this is absurd, the configuration is still "valid", but it leads to 100k partitions per node, which is impossible.

Such zone configurations must be banned. Such restriction doesn't seem unreasonable. If a user wants to start so many partitions for such a small cluster, they really don't understand what they're doing.

This naturally gives us a minimal number of nodes per the number of partitions, as the following formula (assuming that you can't have 2 replicas of the same partition on the same Ignite node):
{code:java}
nodes >= min(replicas, ceil(partitions * replicas / 65500))
{code}
This estimation is imprecise, because it assumes perfect distribution. In reality, 
rendezvous affinity is uneven, so the real value must be checked when user configures the number of nodes for specific distribution zone.
h4. Ties to rebalance

For this question I would probably need an assistance. While affinity reassignment, each node may store more partition then it's stated in every single distribution. What do I mean by this:
 * imagine node having partitions 1, 2, and 3
 * after the reassignment, the node has partitions 3, 4 and 5

Each individual distribution states that node only has 3 partitions, while during the rebalance, it may store all 5: sending 1 and 2 to some node, and receiving 4 and 5 from some different node.

Multiply that by a big factor, and it is possible to have situation, where local number of partitions exceeds 65500. The only way to beat it, in my opinion, is to lower the hard limit in affinity function to 32xxx per node, leaving a space for partitions in a MOVING state.
h4. Mapping partition ids

With that being said, all that's left is to map logical partition ids from the range 0..N (where N is unlimited) to physical ids from the range 0..65500.

Such mapping is a local entity, encapsulated deep inside of the storage engine. Simplest way to do so is to have a HashMap \{ logical -> physical } and to increase physical partition id by 1 every time you insert a new value. If the {{values()}} set is not continuous, one may occupy the gap, it's not too hard to implement.

Of course, this correspondence must be persisted with data. When we ready physical partition from the storage, we must be able to know its logical id.


> Increase partitions count upper bound
> -------------------------------------
>
>                 Key: IGNITE-19133
>                 URL: https://issues.apache.org/jira/browse/IGNITE-19133
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> h3. Problem
> Data partitioning is used to distribute data (hopefully) evenly across the cluster and to provide necessary operation parallelism for the end user. As a rule of thumb, one may consider allocating 256 partitions per Ignite node, in order to achieve that.
> This rule only scales up to a certain point. Imagine a cluster of 1000 nodes, with a table that has 3 replicas of each partition (ability to lose 1 backup). With current limit of 65500 partitions, the maximal number of partitions per node would be {{{}65500*3/1000 ~= 196{}}}. This is the limit of our scalability, according to aforementioned rule. To provide 256 partitions per node, the user would have to:
>  * either increase the number of backups, which proportionally increases required storage space (affects cost),
>  * or increase the total number of partitions up to about 85 thousands. This is not possible right now.
> h3. What's the reason of current limit
> Disclaimer: I'm not the one who designed it, so my thoughts may be speculative in some sense.
> Short answer is: we need a number of partitions to fit into 2 bytes.
> Long answer: in current implementation we have 1 to 1 correspondence between logical partition id and physical partition id. We use the same value both in affinity and in physical file name. This makes system simpler, and I believe that simplicity is the real explanation of the restriction.
> Why does it have to be 2 bytes, and not 3, for example. The key is the structure of page identifiers in data regions:
> {code:java}
> +---------------------------+----------------+---------------------+-----------------+
> | rotation/item id (1 byte) | flags (1 byte) | partition (2 bytes) | index (4 bytes) |
> +---------------------------+----------------+---------------------+-----------------+{code}
> The idea was to fit it into 8 bytes. Good idea, in my opinion. Making it bigger doesn't feel right.
> h3. Proposed solution
> As mentioned, there are to components to the problem:
>  # One to one correspondence between partition ids.
>  # Hard limit in a single data region, caused by the page id layout.
> There's not much we can do with component #2, because the implications are unpredictable, and the amount of code we would need to fix is astonishing.
> h4. More reasonable restrictions
> This leads us to the following problem: every single Ignite node can't have more than 65500 partitions for a table (or distribution zone). So, imagine the situation:
>  * user has a cluster with 3 nodes
>  * user tries to create distribution zone with 3 nodes, 3 replicas for each partitions and 100000 partitions
> While this is absurd, the configuration is still "valid", but it leads to 100k partitions per node, which is impossible.
> Such zone configurations must be banned. Such restriction doesn't seem unreasonable. If a user wants to start so many partitions for such a small cluster, they really don't understand what they're doing.
> This naturally gives us a minimal number of nodes per the number of partitions, as the following formula (assuming that you can't have 2 replicas of the same partition on the same Ignite node):
> {code:java}
> nodes >= min(replicas, ceil(partitions * replicas / 65500))
> {code}
> This estimation is imprecise, because it assumes perfect distribution. In reality, rendezvous affinity is uneven, so the real value must be checked when user configures the number of nodes for specific distribution zone.
> h4. Ties to rebalance
> For this question I would probably need an assistance. While affinity reassignment, each node may store more partition then it's stated in every single distribution. What do I mean by this:
>  * imagine node having partitions 1, 2, and 3
>  * after the reassignment, the node has partitions 3, 4 and 5
> Each individual distribution states that node only has 3 partitions, while during the rebalance, it may store all 5: sending 1 and 2 to some node, and receiving 4 and 5 from some different node.
> Multiply that by a big factor, and it is possible to have situation, where local number of partitions exceeds 65500. The only way to beat it, in my opinion, is to lower the hard limit in affinity function to 32xxx per node, leaving a space for partitions in a MOVING state.
> h4. Mapping partition ids
> With that being said, all that's left is to map logical partition ids from the range 0..N (where N is unlimited) to physical ids from the range 0..65500.
> Such mapping is a local entity, encapsulated deep inside of the storage engine. Simplest way to do so is to have a HashMap \{ logical -> physical } and to increase physical partition id by 1 every time you insert a new value. If the {{values()}} set is not continuous, one may occupy the gap, it's not too hard to implement.
> Of course, this correspondence must be persisted with data. When we read physical partition from the storage, we must be able to know its logical id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)