You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by gr...@apache.org on 2019/04/12 21:51:11 UTC

[kudu] 02/04: [docs] add design doc for location awareness

This is an automated email from the ASF dual-hosted git repository.

granthenke pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git

commit be8fcc6879ed23c79e562a15e46310b1e9522432
Author: Alexey Serbin <al...@apache.org>
AuthorDate: Wed Apr 10 15:19:23 2019 -0700

    [docs] add design doc for location awareness
    
    Added the design document for the location awareness feature.  I used
    the pandoc tool (https://pandoc.org/) to convert it from the .html
    format after exporting it from the original google docs format.
    I manually removed comments and extra anchors.
    
    At the time of writing, the original was available at:
    
      https://s.apache.org/location-awareness-design
    
    Change-Id: Ibe913f19520024c43fbb8184cb8dd1f1e679ddec
    Reviewed-on: http://gerrit.cloudera.org:8080/13000
    Reviewed-by: Grant Henke <gr...@apache.org>
    Tested-by: Alexey Serbin <as...@cloudera.com>
---
 docs/design-docs/location-awareness.md | 386 +++++++++++++++++++++++++++++++++
 1 file changed, 386 insertions(+)

diff --git a/docs/design-docs/location-awareness.md b/docs/design-docs/location-awareness.md
new file mode 100644
index 0000000..58bb688
--- /dev/null
+++ b/docs/design-docs/location-awareness.md
@@ -0,0 +1,386 @@
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+Kudu Location Awareness
+=======================
+
+Requirements
+============
+
+Motivation
+----------
+
+Kudu is designed to tolerate independent failures of single nodes.
+Tablets are normally replicated at least 3 times via Raft consensus,
+remain available (possibly with a short interruption) when any single
+replica becomes unavailable, and automatically re-replicate data if a
+replica is unavailable for too long. However, there can be correlated
+failures of multiple nodes when e.g. an entire rack loses power or an
+AWS availability zone has its fiber severed. Kudu has matured to the
+point where users require features that allow them to prepare for
+correlated node failures.
+
+Use Cases
+---------
+
+Below are a few use cases involving Kudu location awareness policies.
+The motivation is to clarify on the big picture and build a high-level
+proposal, so we could be on the same page in terms of our expectations.
+Hopefully, having these will help to iterate on the set of requirements
+for the MVP that supports location awareness policies.
+
+### 1 Required: Rack-aware placement of replicas to survive per-rack hardware failures
+
+Having servers in multiple racks, place tablet's data across racks so
+that if a network switch/power supply on one rack fails, the tablet's
+data is still available for reading and writing.
+
+### 2 Required: Keeping up-to-date inter-datacenter backup replicas
+
+That’s to not lose data in case of a failure of all hardware in a single
+datacenter/zone. Running servers in multiple datacenters/zones, place
+data in different datacenters/zones so that if all servers in one
+datacenter/zone fail, the tablet's data is still available for reading
+and writing.  For this use case, it's assumed that the latencies between
+datacenters are negligible.
+
+### 3 Strict locality constraints
+
+Having servers in multiple datacenters/zones, place table's data across
+servers only in a particular datacenter/zone (something like GDPR-alike
+constraint), not allowing replicas to be placed in any other
+zone/datacenter even in case of re-balancing.                          
+                                             
+
+### 4 Strict inter-datacenter locality constraints and intra-datacenter rack-awareness
+
+In case of deployment with multi-datacenter or multi-zone setup, it’s
+necessary to have affinity for some tables (e.g., stick them to
+particular DC or AZ).  But within that DC, all tablets of those tables
+should remain available if a single rack fails.  Basically, that’s
+superposition of cases 1 and 3.
+
+Requirements
+------------
+
+-   Allow administrators to label nodes with locations.
+
+-   Support location awareness for clients as well, if labeled.
+
+-   Support location-aware replica placement policies that allow
+    administrators to control where replicas of a table are placed based
+    on a policy set for the cluster as whole. The supported policies
+    must be at least expressive enough to handle the following use case
+    (but may be more expressive):
+
+-   Ensure that no location contains a majority of the replicas if there
+    are more than 2 locations.
+
+-   Support tablets with replicas split between locations that have
+    reliable, low-latency connectivity between them, e.g. between AWS
+    availability zones in the same region.
+
+-   Diagnostics and warnings should be able to indicate when the
+    connection between locations is not reliable or has high latency.
+
+-   Support location-aware clients so at minimum a location-aware client
+    performing a READ\_CLOSEST scan will read from the same location if
+    possible.
+-   Enhance the rebalancing tool to preserve placement policies if they
+    hold and to move replicas to enforce placement policies if they are
+    violated.
+-   Provide other appropriate guides and tools to administer and support
+    the features needed to fulfill the above requirements.
+
+-   Documentation on setting up location awareness, covering at least
+    how to achieve availability in the face of a single location failure
+-   CLI tools to observe location labels and policies
+
+Non-requirements
+----------------
+
+-   Support tablets with replicas distributed globally, or to locations
+    that do not have high-quality, low-latency connections.
+-   Support a distinguished “location-aware durability” case where a
+    tablet is durable but not available if a single location is
+    permanently lost. This can be achieved by splitting replicas evenly
+    across two locations with an even replication factor, for example.
+-   Support location-awareness for tablet leadership. For example,
+    require that 2 out of 3 replicas be placed in location “us-east-1”
+    and the tablet leader be one of the two replicas in that location.
+-   Support performance- or regulation-motivated placement policies at
+    higher granularity.
+
+-   E.g. “place all replicas of tablets for range partition ‘EUROPE-PII’
+    in location ‘eu-1’”
+
+-   Optimize the total data transfer between replicas in order to
+    minimize the potential costs of splitting tablets across locations,
+    e.g. AWS charging for transfer between availability zones.
+
+-   For example, optimizing tablet copy to copy from replicas in the
+    same location as the destination replica as opposed to copying from
+    the leader.
+
+Kudu Location Awareness - Design
+================================
+
+Proposal: “Hew to HDFS”
+-----------------------
+
+This proposal fulfills the requirements with a design that resembles
+HDFS’s rack awareness features as much as possible. As Kudu tablets
+requires a majority of replicas to be available while HDFS blocks
+require only one, there are significant differences.
+
+### Node labeling
+
+Kudu will adopt HDFS’s method of node labeling. The master will gain a
+new flag \`--location\_mapping\_cmd\` whose value will be an executable
+script. The script should accept one or more hostnames or IP addresses and
+return, for each hostname or IP addresses provided, a location. A
+location is a /-separated string that begins with ‘/’, like a Unix path,
+and whose characters in the /-separated components are limited to those
+from the set [a-zA-Z0-9\_-.].
+
+The  /-separation signals that hierarchical semantics may be introduced
+later.
+
+Whenever a tablet server registers or re-registers with the master, the
+master will assign it a location by calling the script with the hostname
+or IP of the server. The assigned location will be cached and the cached
+location will be used until the tablet server
+re-registers, in which case the master
+will re-resolve the tablet server’s location using the script. The
+leader master will use location information to place new replicas, both
+when creating a table and when re-replicating.
+
+### Placement policies
+
+Recall the one tablet replica placement policy required of any design:
+
+-   If there are more than 2 locations, ensure that no location contains
+    a majority of the replicas.
+
+This design proposes a placement policy for tablet creation and
+re-replication that makes a best-effort to comply with this requirement
+but will violate it in order to maintain full re-replication. This is
+the same enforcement as HDFS.
+
+Additionally, this design has one more policy that it makes a best
+effort to satisfy:
+
+-   Ensure that, when there are two locations, no more than floor(r / 2)
+    + 1 replica of a replication factor r tablet are placed in a single
+    location.
+
+These two policies will apply to every table in the cluster and will not
+be configurable. Future implementations may introduce configurable
+placement policies for the cluster or sub-elements of the cluster like
+tables, or may introduce different strictness semantics or guarantees
+around enforcement.
+
+#### Definitions
+
+-   A location is available if at least one tablet server in the
+    location is available, according to the master.
+-   If the number of tablet servers in a location is N and the total
+    number of replicas in the location is R, the load or total load of
+    the location is defined as R / N. For a table T with R replicas in
+    the location, the table load or load for table T is defined as R /
+    N.
+-   The skew between two locations is defined as the absolute value of
+    the difference between their loads. Table skew between locations is
+    defined analogously.
+-   The skew of a set of locations is defined as the maximum skew
+    between any pair of locations. The table skew of a set of locations
+    is defined analogously.
+-   A set of locations is balanced if no replica can be moved between
+    locations without either violating the placement policy or
+    increasing the skew. Table-wise balance is defined analogously.
+
+Note that the definitions of skew and balance generalize the definitions
+used in the original rebalancer design, if we identify a location
+consisting of a single tablet server with its tablet server.
+
+#### Tablet Creation
+
+1.  On tablet creation, the master will choose locations for the
+    replicas by repeatedly using the power of two choices, starting from
+    a pool of all available locations. The load factor used to make the
+    binary choice in the algorithm will be the load of the location.
+    When a location is picked it is removed from the pool. If there are
+    still replicas remaining after all locations have been used, all
+    available locations will be put back into the pool except those
+    which cannot possibly accept a replica because every tablet server
+    in the location already hosts a replica.
+2.  Within a location, replicas will be placed using the power of two
+    choices algorithm among the tablet servers.
+
+If it is possible to conform to the placement policies given the number
+of tablet servers in each location, then this policy clearly produces an
+arrangement that complies with the policy.
+
+The new placement algorithm generalizes the original by making it have
+two stages: choosing locations, then choosing replicas within location.
+The original algorithm can be recovered by considering all tablet
+servers to be in a single location. This algorithm also generalizes in
+an obvious way to more levels of location hierarchy.
+
+#### Re-replication
+
+Re-replication will be a special case of the above algorithm, where we
+are picking a tablet server for a single replica, given that we have
+already picked the locations for the other replicas.
+
+### Location-aware clients
+
+When the client connects to the cluster, it will include the hostname or
+IP of the machine it is running on. The master will assign it a location
+and return that to the client. It will also return location information
+to the client about tablet servers. The client will use this information
+to select a server to read from in READ\_CLOSEST mode. It will prefer a
+local server, as it does now, and then after that prefer servers in the
+same location (with ties by random choice). This procedure generalizes
+to a hierarchical scheme: the client would prefer servers having the
+longest suffix of common locations in the /-separated string.
+
+### Rebalancing
+
+This design assumes that preserving and enforcing the placement policy
+is higher priority than balancing. Therefore, the rebalancer will
+attempt to conform the cluster with the placement policy and, only
+afterwards, attempt to balance the cluster while maintaining the
+placement policy.
+
+#### Rebalancing: background
+
+Recall that previously a cluster was considered balanced if the maximum
+skew between any pair of tablet servers was 0 or 1. Table balance was
+defined analogously. Currently, the rebalancer uses a two-stage greedy
+algorithm. It is guaranteed to balance every table and the cluster as
+long as replica moves succeed. The thrust of the algorithm is as
+follows:
+
+-   For each table:
+
+-   Order the tablet servers by load for the table.
+-   While the table is not balanced:
+
+-   Move a replica from the most loaded server to the least
+
+-   All tables are now balanced. Balance the cluster:
+
+-   While the cluster is not balanced
+
+-   Move a replica from the most loaded server to the least
+
+One key reason this algorithm works is that it is always possible to
+move a replica from the most loaded server to the least loaded. However,
+this is not true when balancing replicas between locations. Consider a
+cluster with 3 locations {A, B, C}. A has 100 tablet servers in it while
+B and C have 2 each. If tablet creation results in replica counts {A -\>
+2, B -\> 2, C -\> 1}, B has load 2 while A has load 1/50, but every
+replica move from B to A violates the placement policy. So, it is not
+always possible to move a replica from the most loaded location to the
+least loaded.
+
+Yet, it is possible to characterize the situations when a move is
+impossible. Suppose now we have two locations A and B. A consists of S
+tablet servers hosting M replicas of a tablet X among them. B consists
+of T tablet servers hosting N replica of a tablet X among them. Suppose
+the replication factor of the tablet is R. There are two reasons why a
+move of a replica of tablet X from A to B could be illegal:
+
+1.  A tablet server may never host more than one replica of the same
+    tablet. If this principle forbids a move, then N = T. Because of the
+    placement policy, this means that T \<= floor(R / 2), so 2T \<= R.
+    This can only happen if the replication factor is big compared to
+    the number of servers in the smallest location.
+2.  Moving a replica from A to B would violate the placement policy.
+    This can plausibly happen for a single tablet. However, for this to
+    be true for every replica in the table, every tablet would need to
+    have floor(R / 2) replicas in location B, so the load of B for the
+    table is N / T =  floor(R)/2, the maximum value of load if the
+    placement policy holds. But then we wouldn’t be moving from A to
+    this location if the goal is to reduce the skew. A similar argument
+    applies to moves looking at the total load of the location as well.
+
+\#1 is not expected to be a problem in the common case. Almost all
+tablets are 3x replicated, and 3x replication precludes multiple
+replicas in the same location anyway. Allowing for 5x replication, a
+location would need to have 2 or less servers to be vulnerable.
+Therefore, in the common case, it should be possible to move a replica
+from the most loaded location to the least loaded, and a similar greedy
+algorithm will produce good results.
+
+#### Rebalancing stage 1: enforce placement policy
+
+The rebalancer will first attempt to reestablish the placement policy if
+it has been violated. It will find a set of replicas whose movement to
+new locations will bring the tablet in compliance with the placement
+policy. It will mark all of these tablets with the REPLACE attribute,
+and then wait for the master to place them in new locations. If there is
+no such set of replica then this step is skipped. This is nice and
+simple as it concentrates responsibility for the placement policy in the
+master’s replica selection algorithm. It’s suboptimal since the current
+master replica selection algorithm is not tuned to keep the cluster
+well-balanced in many situations. This is planned to be improved by
+later rebalancer work.
+
+#### Rebalancing stage 2: rebalance
+
+The rebalancer will balance hierarchically: first balancing locations,
+then balancing within locations.
+
+1.  For each table:
+
+1.  Order the locations by load for that table
+2.  While there are still legal moves
+
+1.  Move from the most loaded locations to the least loaded
+
+1.  Within the location, move from the most loaded tablet server to the
+    least loaded, to reduce the within-location phase’s work
+
+3.  Once legal moves are exhausted, remove the most loaded location from
+    consideration, so the second-most-loaded becomes the most loaded.
+    Return to b.
+
+2.  Repeat the above steps for replicas of the cluster as a whole.
+3.  Within each location, rebalance using the original rebalancing
+    algorithm, as if the location were a cluster of its own.
+
+I don’t think this algorithm is guaranteed to produce a balanced cluster
+in all cases, but in cases where the replication factor is small
+compared to the size of all of the locations it should produce an
+optimal outcome. In fact, there may exist a better balancing algorithm,
+but it’s not worth spending a lot of time devising, implementing, and
+supporting it as we will almost surely eventually adopt a very different
+stochastic gradient descent sort of algorithm one rebalancing is built
+into the master and repurposed to run constantly in the background.
+Furthermore, I think this algorithm will produce decent results in odd
+cases of high replication factors and small locations, where either the
+placement policy cannot be complied with at all or where \#1 from above
+could limit rebalancing. We may make small tweaks to the algorithm for
+these cases if experiment or experience indicates it is necessary.
+
+### Tools and documentation
+
+#### kudu tool (excluding rebalancing)
+
+The ksck tool will be location-aware: it will list tablet server
+locations and highlight tablets that violate placement policies, at the
+warning level. Additionally, the kudu tserver list tool will be able to
+list locations.