You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@helix.apache.org by lx...@apache.org on 2018/06/04 21:40:52 UTC

[1/2] helix git commit: Adding CrushED design document

Repository: helix
Updated Branches:
  refs/heads/master c06c7710d -> ad5109848


Adding CrushED design document


Project: http://git-wip-us.apache.org/repos/asf/helix/repo
Commit: http://git-wip-us.apache.org/repos/asf/helix/commit/1f69e5a0
Tree: http://git-wip-us.apache.org/repos/asf/helix/tree/1f69e5a0
Diff: http://git-wip-us.apache.org/repos/asf/helix/diff/1f69e5a0

Branch: refs/heads/master
Commit: 1f69e5a0be6a35b03d28ab56e1399873bcc3b025
Parents: c06c771
Author: Eric Kim <er...@linkedin.com>
Authored: Fri Jun 1 11:32:03 2018 -0700
Committer: Eric Kim <er...@linkedin.com>
Committed: Fri Jun 1 11:32:03 2018 -0700

----------------------------------------------------------------------
 website/0.6.1-incubating/pom.xml                |   2 +-
 website/0.6.2-incubating/pom.xml                |   2 +-
 website/0.6.3/pom.xml                           |   2 +-
 website/0.6.4/pom.xml                           |   2 +-
 website/0.6.5/pom.xml                           |   2 +-
 website/0.6.6/pom.xml                           |   2 +-
 website/0.6.7/pom.xml                           |   2 +-
 website/0.6.8/pom.xml                           |   2 +-
 website/0.6.9/pom.xml                           |   2 +-
 website/0.7.0-incubating/pom.xml                |   2 +-
 website/0.7.1/pom.xml                           |   2 +-
 website/0.8.0/pom.xml                           |   2 +-
 website/0.8.1/pom.xml                           |   2 +-
 .../0.8.1/src/site/markdown/design_crushed.md   | 191 +++++++++++
 website/0.8.1/src/site/markdown/index.md        |   4 +
 .../design/crushed/after-using-crushed.png      | Bin 0 -> 9935 bytes
 .../design/crushed/before-using-crush.png       | Bin 0 -> 10286 bytes
 .../resources/images/design/crushed/classes.png | Bin 0 -> 14742 bytes
 .../design/crushed/crushed-master-dist.png      | Bin 0 -> 18725 bytes
 .../design/crushed/crushed-partition-dist.png   | Bin 0 -> 16630 bytes
 .../images/design/crushed/cursh-master-dist.png | Bin 0 -> 18036 bytes
 .../design/crushed/cursh-partition-dist.png     | Bin 0 -> 16391 bytes
 .../example-cluster-master-dist-after.png       | Bin 0 -> 12899 bytes
 .../crushed/example-cluster-master-dist.png     | Bin 0 -> 13601 bytes
 .../crushed/example-cluster-partition-dist.png  | Bin 0 -> 13472 bytes
 .../crushed/example-movement-on-expansion.png   | Bin 0 -> 31833 bytes
 .../design/crushed/node-down-master-move.png    | Bin 0 -> 16897 bytes
 .../design/crushed/node-down-partition-move.png | Bin 0 -> 15982 bytes
 .../images/design/crushed/performance.png       | Bin 0 -> 17532 bytes
 website/pom.xml                                 |   2 +-
 website/src/site/markdown/design/crush-ed.md    | 336 +++++++++++++++++++
 31 files changed, 545 insertions(+), 14 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.6.1-incubating/pom.xml
----------------------------------------------------------------------
diff --git a/website/0.6.1-incubating/pom.xml b/website/0.6.1-incubating/pom.xml
index 2321886..053e632 100644
--- a/website/0.6.1-incubating/pom.xml
+++ b/website/0.6.1-incubating/pom.xml
@@ -23,7 +23,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>website</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
 
   <artifactId>0.6.1-incubating-docs</artifactId>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.6.2-incubating/pom.xml
----------------------------------------------------------------------
diff --git a/website/0.6.2-incubating/pom.xml b/website/0.6.2-incubating/pom.xml
index 62f8d36..b38adc5 100644
--- a/website/0.6.2-incubating/pom.xml
+++ b/website/0.6.2-incubating/pom.xml
@@ -23,7 +23,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>website</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
 
   <artifactId>0.6.2-incubating-docs</artifactId>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.6.3/pom.xml
----------------------------------------------------------------------
diff --git a/website/0.6.3/pom.xml b/website/0.6.3/pom.xml
index 5972a36..8fcf016 100644
--- a/website/0.6.3/pom.xml
+++ b/website/0.6.3/pom.xml
@@ -23,7 +23,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>website</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
 
   <artifactId>0.6.3-docs</artifactId>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.6.4/pom.xml
----------------------------------------------------------------------
diff --git a/website/0.6.4/pom.xml b/website/0.6.4/pom.xml
index 2702432..62fd34a 100644
--- a/website/0.6.4/pom.xml
+++ b/website/0.6.4/pom.xml
@@ -23,7 +23,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>website</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
 
   <artifactId>0.6.4-docs</artifactId>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.6.5/pom.xml
----------------------------------------------------------------------
diff --git a/website/0.6.5/pom.xml b/website/0.6.5/pom.xml
index fde2e00..7ff15fb 100644
--- a/website/0.6.5/pom.xml
+++ b/website/0.6.5/pom.xml
@@ -23,7 +23,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>website</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
 
   <artifactId>0.6.5-docs</artifactId>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.6.6/pom.xml
----------------------------------------------------------------------
diff --git a/website/0.6.6/pom.xml b/website/0.6.6/pom.xml
index c7aa9ce..623216d 100644
--- a/website/0.6.6/pom.xml
+++ b/website/0.6.6/pom.xml
@@ -23,7 +23,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>website</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
 
   <artifactId>0.6.6-docs</artifactId>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.6.7/pom.xml
----------------------------------------------------------------------
diff --git a/website/0.6.7/pom.xml b/website/0.6.7/pom.xml
index edda5dc..fb402b2 100644
--- a/website/0.6.7/pom.xml
+++ b/website/0.6.7/pom.xml
@@ -23,7 +23,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>website</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
 
   <artifactId>0.6.7-docs</artifactId>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.6.8/pom.xml
----------------------------------------------------------------------
diff --git a/website/0.6.8/pom.xml b/website/0.6.8/pom.xml
index e30ae4e..8faaef7 100644
--- a/website/0.6.8/pom.xml
+++ b/website/0.6.8/pom.xml
@@ -23,7 +23,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>website</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
 
   <artifactId>0.6.8-docs</artifactId>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.6.9/pom.xml
----------------------------------------------------------------------
diff --git a/website/0.6.9/pom.xml b/website/0.6.9/pom.xml
index ed33598..a6070eb 100644
--- a/website/0.6.9/pom.xml
+++ b/website/0.6.9/pom.xml
@@ -23,7 +23,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>website</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
 
   <artifactId>0.6.9-docs</artifactId>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.7.0-incubating/pom.xml
----------------------------------------------------------------------
diff --git a/website/0.7.0-incubating/pom.xml b/website/0.7.0-incubating/pom.xml
index f38abdd..125e89a 100644
--- a/website/0.7.0-incubating/pom.xml
+++ b/website/0.7.0-incubating/pom.xml
@@ -23,7 +23,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>website</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
 
   <artifactId>0.7.0-incubating-docs</artifactId>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.7.1/pom.xml
----------------------------------------------------------------------
diff --git a/website/0.7.1/pom.xml b/website/0.7.1/pom.xml
index 7f71811..ae6d35d 100644
--- a/website/0.7.1/pom.xml
+++ b/website/0.7.1/pom.xml
@@ -23,7 +23,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>website</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
 
   <artifactId>0.7.1-docs</artifactId>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.0/pom.xml
----------------------------------------------------------------------
diff --git a/website/0.8.0/pom.xml b/website/0.8.0/pom.xml
index 448c6da..b0c3ad3 100644
--- a/website/0.8.0/pom.xml
+++ b/website/0.8.0/pom.xml
@@ -23,7 +23,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>website</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
 
   <artifactId>0.8.0-docs</artifactId>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/pom.xml
----------------------------------------------------------------------
diff --git a/website/0.8.1/pom.xml b/website/0.8.1/pom.xml
index 0667e26..dde6935 100644
--- a/website/0.8.1/pom.xml
+++ b/website/0.8.1/pom.xml
@@ -23,7 +23,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>website</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
 
   <artifactId>0.8.1-docs</artifactId>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/markdown/design_crushed.md
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/markdown/design_crushed.md b/website/0.8.1/src/site/markdown/design_crushed.md
new file mode 100644
index 0000000..7fd1773
--- /dev/null
+++ b/website/0.8.1/src/site/markdown/design_crushed.md
@@ -0,0 +1,191 @@
+CrushED (Crush-based rebalancer with Even Distribution)
+============================================
+
+## Overview
+
+Helix provides AutoRebalanceStrategy which is based on card dealing strategy. This strategy takes the current mapping as an input, and computes new mappings only for the partitions that need to be moved. This provides minimum partition movement, but the mapping is not deterministic, and moreover, fault-zone aware mapping (i.e. rack-aware partitioning) is not possible. 
+
+CRUSH-based partitioning scheme was implemented to provide fault-zone aware mapping and deterministic partition assignment. CrushRebalanceStrategy (and MultiRoundCrushRebalanceStrategy) algorithm uses pseudo-random partition placement to ensure consistent partition distribution. As the number of placed items (i.e partitions) approaches infinity, the distribution will be perfectly uniform. However, with a small number of placed items, especially for resources (i.e. databases) with a small number of partitions, the placement algorithm may result in fairly  uneven partition distribution.  
+
+We want to provide a new rebalance strategy that provides a deterministic and fault-zone aware mapping while providing even partition distribution in all cases. In this document, we propose a hybrid algorithm that uses CRUSH, card dealing strategy, and consistent hashing to ensure both even distribution and minimal partition movement (while cluster topology remains the same). We call it CrushED (Crush w/ Even Distribution). Compared to CRUSH, CrushED results in a much more uniform distribution and minimal partition movements as long as topology remains the same, at the cost of additional run time computation.  
+
+## Design
+
+In addition to what we already achieved in CrushRebalanceStrategy, we have 2 high level goals :
+
+1.  Even distribution.
+2.  Minimize partition movements when instances go up/down.
+
+CrushRebalanceStrategy has very small movement count, but the distribution is not optimal. MultiRoundCrushRebalanceStrategy was designed to solve this problem by running CRUSH multiple times on partition assignments that contribute to uneven mapping. However, due to potentially high number of rounds, computation cost is high, we observed significantly more partition movements when the cluster topology is changed.
+
+Since we have a good base strategy, CrushRebalanceStrategy, we built CrushEDRebalanceStrategy on top of it. Sample mapping of both strategies are as following. Note that blue parts remain unchanged before and after.
+
+Before (CRUSH)
+
+![Before (CRUSH)](images/design/crushed/before-using-crush.png)
+
+After (new strategy)
+
+![After (new strategy)](images/design/crushed/after-using-crushed.png)
+
+Since the problem is NP-hard. We are not expecting the best assignment. A greedy algorithm works good enough.  
+After we tried different designs, we found it's hard to achieve both goals (even distribution and fewer movements) using a single strategy. So we decided to apply a hybrid algorithm that finishes the work step by step.
+
+**Step 1, run CRUSH to get a base assignment.**  
+The base assignment usually contains a certain number of uneven partitions, so we need the following steps to re-distribute them.
+
+**Step 2, run a card dealing algorithm on the uneven parts.**  
+And assign them to idle nodes. This algorithm is conceptually simple. The result ensures that all partitions are assigned to instances with minimum difference. Note that when fault zone joins the game, our greedy algorithm may not be able to calculate possible results because the candidate assignment may have fault zone conflict. So we add the buffer to tolerate small uneven assignment.
+
+Example of assignments after step 2,
+
+![Example](images/design/crushed/example-cluster-partition-dist.png)
+
+**Step 3, Shuffle partitions' preference lists.**  
+Since replica states are assigned according to node order in these lists, if the lists are randomly ordered, State assignment (i.e. Master, Slave, Online, Offline) will also be random, so this may result in uneven states distribution. To resolve this issue, CrushED assigns scores to nodes as it computes pref list, to give all nodes equal chances in appearing at the top of the pref list. This operation results in a much more even state distribution.
+
+Example of master distribution before step 3,
+
+![Example](images/design/crushed/example-cluster-master-dist.png)
+
+Example of master distribution after step 3,
+
+![Example](images/design/crushed/example-cluster-master-dist-after.png)
+
+**Step 4, re-calculate the assignment for the partitions on temporarily disabled nodes using a consistent hashing algorithm.**  
+Consistent hashing ensures minimize partition movement.  
+Note that the first 3 steps are using full node list, regardless of disabled or offline nodes. So the assignment will be stable even the algorithm contains random factors such hashCode. Then step 4 ensures all the disabled nodes are handled correctly without causing huge partition movements.
+
+One potential issue of using intuitive algorithm is not converging. In this case, CrushED falls back to CRUSH.  
+Pseudocode is listed below.
+
+**Pseudo Code** 
+
+    // Round 1: Calculate mapping using the base strategy.
+    // Note to use all nodes for minimizing the influence of live node changes.
+    origPartitionMap = getBaseRebalanceStrategy().computePartitionAssignment(allNodes, clusterData);
+    
+    // Transform current assignment to instance->partitions map, and get total partitions
+    nodeToPartitionMap = convertMap(origPartitionMap);
+
+    // Round 2: Rebalance mapping using card dealing algorithm.
+    Topology allNodeTopo = new Topology(allNodes, clusterData);
+    cardDealer.computeMapping(allNodeTopo, nodeToPartitionMap);
+
+    // Since states are assigned according to preference list order, shuffle preference list for even states distribution.
+    shufflePreferenceList(nodeToPartitionMap);
+
+    // Round 3: Re-mapping the partitions on non-live nodes using consistent hashing for reducing movement.
+    // Consistent hashing ensures minimum movements when nodes are disabled unexpectedly.
+    if (!liveNodes.containsAll(allNodes)) {
+      Topology liveNodeTopo = new Topology(liveNodes, clusterData);
+      hashPlacement.computeMapping(liveNodeTopo, nodeToPartitionMap);
+    }
+
+    if (!nodeToPartitionMap.isEmpty()) {
+      // Round 2 and 3 is done successfully
+      return convertMap(nodeToPartitionMap);
+    } else {
+      return getBaseRebalanceStrategy().computePartitionAssignment(liveNodes, clusterData);
+    }
+
+
+### Maximum uneven partition assignment using CrushED
+
+Helix cluster typically manages 1 or more resources (i.e. databases). For each resource, CrushED makes the best effort to ensure the partition count difference is at most 1 across all the instances. Assuming such assignment is possible considering fault-zone configuration, the worst partition distribution happens when all one off partitions are located in one node. So N resources in a cluster can theoretically have their extra partitions in one node, so the node will have N additional partitions in total. Thus, the maximum difference between the most heavily loaded node and the least is **the number of resources** in a cluster.
+
+## Experiment
+
+We tested CrushED by simulating real production cluster topology data. And we tested multiple scenarios:
+
+*   Distribution based on cluster topology.
+*   Disabling hosts to simulate hosts down.
+*   Adding hosts to simulate expansion.
+*   Rolling upgrade.
+
+All results show that CrushED generates more uniform global distribution compared with CRUSH.  
+Moreover, partition movements in most scenarios are minimized. When topology changes (i.e. cluster expansion), there can be significantly more partition movements, but we can control the impact by using State Transition Throttling feature. 
+
+### Partition Distribution
+
+Following charts demonstrate the worst cases (min load vs. max load) and STDEVs of partition/master distributions from some sample clusters data.  
+If we measure the improvement by STDEV, CrushED improves the partition distribution evenness by 87% on average compared with CRUSH. And for state assignment (i.e. Mastership assignment) the evenness improvement is 68% on average.
+
+![Example](images/design/crushed/cursh-partition-dist.png)![Example](images/design/crushed/crushed-partition-dist.png)
+
+![Example](images/design/crushed/cursh-master-dist.png)![Example](images/design/crushed/crushed-master-dist.png)
+
+### Disabling Nodes
+
+When nodes are offline or disabled, CrushED will re-assign the partitions to other live nodes. The algorithm move only the necessary partitions.  
+We simulated disabling nodes, and measured partition movement changes and mastership changes. We also used the expected movement (the partitions/masters count on the disabled nodes) as a baseline to measure extra movements.
+
+The results show that movement is highly correlated to the number of disabled nodes, and extra movements are minor (in most cases 0 movements).
+
+Note that **Rate** in this document is **the changed number / total partition or master count**.
+
+![Example](images/design/crushed/node-down-partition-move.png)![Example](images/design/crushed/node-down-master-move.png)
+
+### Rolling upgrade
+
+Rolling upgrade is different from disabling nodes. Since nodes are reset one by one, in this test we assume the difference could be 2 nodes in maximum (for example, upgrading Node A then upgrading Node B).  
+In this case, movements are still minimized. Even in the worst case scenario, extra partition movements and mastership changes are still close to 0%.
+
+Note that in real production clusters, we can completely avoid partition movements while doing rolling upgrade, by enabling Delayed Rebalancing.
+
+### Adding Nodes
+
+Adding nodes (i.e. cluster expansion) changes topology. CrushED uses card dealing strategy to provide even distribution, so when topology changes, there are a lot of additional partition movements than CRUSH.  
+
+Note that the extra change rate is not correlated with the number of additional nodes. So our recommendation is finishing expansion in one operation so as to do only one partition shuffling.
+
+### Algorithm Performance
+
+We compared CrushED with CRUSH algorithms using different instance numbers. The tests are executed multiple times and we recorded median computation time.  
+CrushED does not cost much additional computation time for regular rebalancing. In some of the worst cases, 30% more runtime was observed, compared with CRUSH, but it is quicker than MultiRoundCRUSH.
+
+However, when there are down nodes since CrushED needs to run an additional consistent hashing based re-distribution, the computation time will be much longer. In some cases, we saw more than 3 times compared to CRUSH.
+
+With some **performance improvements**, such as using cache to avoid duplicate calculation, we achieved to greatly reduce CrushED's running time. According to our experiment, it is now close to MultiRound CRUSH.
+
+![Example](images/design/crushed/performance.png)
+
+## Conclusion
+
+CrushED achieves more uniform distribution compared with CRUSH at the cost of higher rebalance computation and more partition movement when the cluster topology changes.
+
+## Simple User Guide
+
+1.  Ensure the resouce is using FULL_AUTO mode.
+2.  Set rebalance strategy to be "org.apache.helix.controller.rebalancer.strategy.CrushEdRebalanceStrategy".
+3.  Expect more partition movement on topology changes when using the new strategy.
+
+**IdeaState SimpleFields Example** 
+
+    HELIX_ENABLED : "true"
+    IDEAL\_STATE\_MODE : "AUTO_REBALANCE"
+    REBALANCE\_MODE : "FULL\_AUTO"
+    REBALANCE_STRATEGY : "org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy"
+    MIN\_ACTIVE\_REPLICAS : "0"
+    NUM_PARTITIONS : "64"
+    REBALANCER\_CLASS\_NAME : "org.apache.helix.controller.rebalancer.DelayedAutoRebalancer"
+    REPLICAS : "1"
+    STATE\_MODEL\_DEF_REF : "LeaderStandby"
+
+## Future Works
+
+**Instance Level Capacity Limitation**
+
+Currently, all resources are assigned separately.  
+The pros of this design are that resources change won't cause existing partitions to be re-assigned.  
+The cons are:
+
+1.  It's hard to ensure strict global uniform distribution.
+2.  Instance level capacity control is not possible given the algorithm doesn't have a global view of partition assignment.
+
+**Rebalance Algorithm Takes Partition Weight into Consideration**
+
+This algorithm still considers all partitions to be equally weighted. But in reality, different partitions may have different resource requirements.  
+Application admins need to configure partition weight and Helix should assignment them accordingly.
+
+Note this feature only makes sense when it is applied to a global assignment algorithm since each partition in the same resource are weighted the same.

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/markdown/index.md
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/markdown/index.md b/website/0.8.1/src/site/markdown/index.md
index c736316..9a67ef6 100644
--- a/website/0.8.1/src/site/markdown/index.md
+++ b/website/0.8.1/src/site/markdown/index.md
@@ -52,3 +52,7 @@ under the License.
 ### Operation
 
 [Monitoring Metrics](./Metrics.html)
+
+### Design
+
+[CRUSH-ed for even distribution](./design_crushed.html)

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/after-using-crushed.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/after-using-crushed.png b/website/0.8.1/src/site/resources/images/design/crushed/after-using-crushed.png
new file mode 100644
index 0000000..f6df13e
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/after-using-crushed.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/before-using-crush.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/before-using-crush.png b/website/0.8.1/src/site/resources/images/design/crushed/before-using-crush.png
new file mode 100644
index 0000000..ca6ab63
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/before-using-crush.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/classes.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/classes.png b/website/0.8.1/src/site/resources/images/design/crushed/classes.png
new file mode 100644
index 0000000..59ac3e7
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/classes.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/crushed-master-dist.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/crushed-master-dist.png b/website/0.8.1/src/site/resources/images/design/crushed/crushed-master-dist.png
new file mode 100644
index 0000000..0053c76
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/crushed-master-dist.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/crushed-partition-dist.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/crushed-partition-dist.png b/website/0.8.1/src/site/resources/images/design/crushed/crushed-partition-dist.png
new file mode 100644
index 0000000..31e76ab
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/crushed-partition-dist.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/cursh-master-dist.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/cursh-master-dist.png b/website/0.8.1/src/site/resources/images/design/crushed/cursh-master-dist.png
new file mode 100644
index 0000000..8a37146
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/cursh-master-dist.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/cursh-partition-dist.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/cursh-partition-dist.png b/website/0.8.1/src/site/resources/images/design/crushed/cursh-partition-dist.png
new file mode 100644
index 0000000..67eeb2d
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/cursh-partition-dist.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/example-cluster-master-dist-after.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/example-cluster-master-dist-after.png b/website/0.8.1/src/site/resources/images/design/crushed/example-cluster-master-dist-after.png
new file mode 100644
index 0000000..0364e5d
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/example-cluster-master-dist-after.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/example-cluster-master-dist.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/example-cluster-master-dist.png b/website/0.8.1/src/site/resources/images/design/crushed/example-cluster-master-dist.png
new file mode 100644
index 0000000..fb8c624
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/example-cluster-master-dist.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/example-cluster-partition-dist.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/example-cluster-partition-dist.png b/website/0.8.1/src/site/resources/images/design/crushed/example-cluster-partition-dist.png
new file mode 100644
index 0000000..06d7938
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/example-cluster-partition-dist.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/example-movement-on-expansion.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/example-movement-on-expansion.png b/website/0.8.1/src/site/resources/images/design/crushed/example-movement-on-expansion.png
new file mode 100644
index 0000000..0c594f8
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/example-movement-on-expansion.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/node-down-master-move.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/node-down-master-move.png b/website/0.8.1/src/site/resources/images/design/crushed/node-down-master-move.png
new file mode 100644
index 0000000..d864001
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/node-down-master-move.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/node-down-partition-move.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/node-down-partition-move.png b/website/0.8.1/src/site/resources/images/design/crushed/node-down-partition-move.png
new file mode 100644
index 0000000..e5b22bd
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/node-down-partition-move.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/0.8.1/src/site/resources/images/design/crushed/performance.png
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/resources/images/design/crushed/performance.png b/website/0.8.1/src/site/resources/images/design/crushed/performance.png
new file mode 100644
index 0000000..3e07f53
Binary files /dev/null and b/website/0.8.1/src/site/resources/images/design/crushed/performance.png differ

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/pom.xml
----------------------------------------------------------------------
diff --git a/website/pom.xml b/website/pom.xml
index 16090a7..dba3ce9 100644
--- a/website/pom.xml
+++ b/website/pom.xml
@@ -21,7 +21,7 @@ under the License.
   <parent>
     <groupId>org.apache.helix</groupId>
     <artifactId>helix</artifactId>
-    <version>0.8.1-SNAPSHOT</version>
+    <version>0.8.2-SNAPSHOT</version>
   </parent>
   <modelVersion>4.0.0</modelVersion>
   <packaging>pom</packaging>

http://git-wip-us.apache.org/repos/asf/helix/blob/1f69e5a0/website/src/site/markdown/design/crush-ed.md
----------------------------------------------------------------------
diff --git a/website/src/site/markdown/design/crush-ed.md b/website/src/site/markdown/design/crush-ed.md
new file mode 100644
index 0000000..f4eaa24
--- /dev/null
+++ b/website/src/site/markdown/design/crush-ed.md
@@ -0,0 +1,336 @@
+CRUSH-ed for even distribution
+============================================
+
+*   [Overview](#CRUSH-edforevendistribution-Overview)
+*   [Design](#CRUSH-edforevendistribution-Design)
+*   [Experiment](#CRUSH-edforevendistribution-Experiment)
+    *   [Partition Distribution](#CRUSH-edforevendistribution-PartitionDistribution)
+    *   [Disabling Nodes](#CRUSH-edforevendistribution-DisablingNodes)
+    *   [Rolling upgrade](#CRUSH-edforevendistribution-Rollingupgrade)
+    *   [Adding Nodes](#CRUSH-edforevendistribution-AddingNodes)
+    *   [Algorithm Performance](#CRUSH-edforevendistribution-AlgorithmPerformance)
+    *   [Additional Tests for Corner Cases](#CRUSH-edforevendistribution-AdditionalTestsforCornerCases)
+*   [Conclusion](#CRUSH-edforevendistribution-Conclusion)
+*   [User Guide](#CRUSH-edforevendistribution-UserGuide)
+*   [Future Works](#CRUSH-edforevendistribution-FutureWorks)
+    *   [Instance Level Capacity Limitation](#CRUSH-edforevendistribution-InstanceLevelCapacityLimitation)
+    *   [Rebalance Algorithm Takes Partition Weight into Consideration](#CRUSH-edforevendistribution-RebalanceAlgorithmTakesPartitionWeightintoConsideration)
+
+Overview
+========
+
+CRUSH (or MultiRoundCRUSH) algorithm ensures consistent partition distribution. It is based on pseudo-random partition placement.  When the number of placed items (in our case the partitions) approaches infinity, the distribution approaches perfectly uniform.  
+However, it behaves the worst with a small number of placed items.  Especially, for those single tenant clusters with a small number of partitions, we may experience un-evenly partition distributions over instances in the cluster.  
+According to the data we collected from our PROD environment, distribution calculated by CRUSH is not satisfying. About 50% difference between the heavy loaded nodes and the light loaded nodes. Even with MultiRoundCRUSH, the difference is still about 30%.  
+
+Alternatively, card dealing strategy generates more even distribution in this cases. However, comparing it with CRUSH, there will be much more extra partition movements ( A movement is "extra" if the host instance has one partition out and another partition in) when the cluster topology is changed. The legacy AutoRebalanceStrategy, which is also based on card dealing, resolves this issue by taking the current mapping as an input. And it only changes the mapping for the delta part. It means the algorithm is not deterministic. Worse, it does not support fault zone configuration. So legacy AutoRebalanceStrategy is not an option.
+
+Given a deterministic algorithm is required, we find load balance and minimal partition movements are hard to achieve at the same time. A trade-off between uniform distribution and partition movements during cluster changes is needed.
+
+In this document, we propose a hybrid algorithm based on CRUSH that ensures both even distribution and minimized extra partition movement. We call it CRUSH-ed.  
+The basic idea is running additional rounds of re-balance on the uneven partitions. Note that CRUSH-ed guarantees stability (deterministic) and is fault zone aware.
+
+According to our experiments, CRUSH-ed results in a much more uniform distribution and very few extra partition movements (when nodes are changed unexpectedly) compared with the original CRUSH. The cost is additional run time for the re-assigning calculation.  
+We think CRUSH-ed can to achieve better partition assignment in most of the cases.
+
+Design
+======
+
+In general, we have 2 goals:
+
+1.  Even distribution.
+2.  Minimize partition movements when nodes are changed.
+
+CRUSH has very small movement count, but the distribution is not optimal.
+
+As for MultiRound-CRUSH, it is designed for even distribution. The idea is running CRUSH multiple times. Each time the CRUSH will only be applied to the spiking part. So the distribution would eventually converge to even distribution.  
+However, the number of iterations that are required is not guaranteed. And there will be more partition movements when the topology of the nodes is changed. As a result, it more or less fails both goals. Still not ideal.
+
+Since we already have a good base, we built CRUSH-ed based on CRUSH. It changes the uneven partition assignment generated by CRUSH as following.  
+Note that blue part should keep unchanged before and after.
+
+Before (CRUSH)
+
+![Before (CRUSH)](images/design/crushed/214844314.png)
+
+After (new strategy)
+
+![After (new strategy)](images/design/crushed/214844313.png)
+
+Since the problem is NP-hard. We are not expecting the best assignment. A greedy algorithm works good enough.  
+After we tried different designs, we find it's hard to achieve both goals (even distribution and fewer movements) using a single strategy. So we decided to apply a hybrid algorithm that finishes the work step by step.
+
+**Step 1, run CRUSH to get a base assignment.**  
+The base assignment usually contains a certain number of uneven partitions, so we need the following steps to re-distribute them.
+
+**Step 2, run a card dealing algorithm on the uneven parts.**  
+And assign them to idle nodes. This algorithm is conceptually simple. The result ensures that all partitions are assigned to instances with minimum difference. Note that when fault zone joins the game, our greedy algorithm may not be able to calculate possible results because the candidate assignment may have fault zone conflict. So we add the buffer to tolerate small uneven assignment.
+
+Example of assignments after step 2,
+
+![Example](images/design/crushed/214844288.png)
+
+**Step 3, Shuffle partitions' preference lists.**  
+Since replica states are assigned according to node order in these lists, if the lists are randomly ordered, the states will also be random. This may cause uneven states distribution.  
+To resolve this issue, before the final step 4, CRUSH-ed executes a simpler version card dealing algorithm on replica's order to shuffle them in each preference list. This operation results in a much evener state distribution.
+
+Example of master distribution before step 3,
+
+![Example](images/design/crushed/214844287.png)
+
+Example of master distribution after step 3,
+
+![Example](images/design/crushed/214844286.png)
+
+**Step 4, re-calculate the assignment for the partitions on temporarily disabled nodes using a consistent hashing algorithm.**  
+Consistent hashing can ensure minimize partition movement.  
+Note that the first 3 steps are using full node list, regardless of disabled or offline nodes. So the assignment will be stable even the algorithm contains random factors such hashCode. Then step 4 ensures all the disabled nodes are handled correctly without causing huge partition movements.
+
+Assume the first node is down, after step 4, the final assignment and master distribution are as following. Note that there is no assignment on the first node anymore.
+
+![Example](images/design/crushed/214844285.png)
+
+One potential issue of using intuitive algorithm is not converging. In this case, CRUSH-ed falls back to CRUSH.  
+Pseudocode is listed below.
+
+**Pseudo Code** 
+
+    // Round 1: Calculate mapping using the base strategy.
+    // Note to use all nodes for minimizing the influence of live node changes.
+    origPartitionMap = getBaseRebalanceStrategy().computePartitionAssignment(allNodes, clusterData);
+    
+    // Transform current assignment to instance->partitions map, and get total partitions
+    nodeToPartitionMap = convertMap(origPartitionMap);
+
+    // Round 2: Rebalance mapping using card dealing algorithm.
+    Topology allNodeTopo = new Topology(allNodes, clusterData);
+    cardDealer.computeMapping(allNodeTopo, nodeToPartitionMap);
+
+    // Since states are assigned according to preference list order, shuffle preference list for even states distribution.
+    shufflePreferenceList(nodeToPartitionMap);
+
+    // Round 3: Re-mapping the partitions on non-live nodes using consistent hashing for reducing movement.
+    // Consistent hashing ensures minimum movements when nodes are disabled unexpectedly.
+    if (!liveNodes.containsAll(allNodes)) {
+      Topology liveNodeTopo = new Topology(liveNodes, clusterData);
+      hashPlacement.computeMapping(liveNodeTopo, nodeToPartitionMap);
+    }
+
+    if (!nodeToPartitionMap.isEmpty()) {
+      // Round 2 and 3 is done successfully
+      return convertMap(nodeToPartitionMap);
+    } else {
+      return getBaseRebalanceStrategy().computePartitionAssignment(liveNodes, clusterData);
+    }
+
+The re-assigning logic is generic and can be applied to any algorithms. So we created an abstract class for common logic. And then create strategy classes for Helix users. The Java class diagram looks like following.
+
+![Example](images/design/crushed/214844307.png)
+
+### Cap of difference participant load using CRUSH-ed
+
+For each resource, CRUSH-ed tries the best to ensure the partition count difference is in maximum ONE (as long as the assignment is possible considering fault zone). In this case, the worst assignment happens when all these differences are located on one node. So N resources contribute their additional partitions to one node, the node will have N additional partitions in total. Theoretically, the maximum difference between the heavy loaded node and other nodes is **the number of resources** in a cluster.
+
+### Not a global uniform distribution solution that considers node capacity
+
+Global distribution algorithm requires the algorithm takes other resources' partition distribution as part of the input. Since this related the distribution of all resources, the distribution would be very unstable. Any change in any resource will change it. As a result, the rebalance process may lead to a large number of partition movement.  
+Therefore, the solution in this document cannot be used to calculate global uniform distribution, neither node capacity aware throttling.  
+This requirement would be investigated in another initiative. Please refer to future works.
+
+Experiment
+==========
+
+We tested CRUSH-ed by simulating rebalancing using real product clusters topology data. And we tested multiple scenarios:
+
+*   Distribution based on cluster topology.
+*   Disabling hosts to simulate hosts down.
+*   Adding hosts to simulate expansion.
+*   Rolling upgrade.
+
+All results show that CRUSH-ed generates more uniform global distribution compared with CRUSH.  
+Moreover, partition movements in most scenarios are minimized. Except for hosts expansion case. Which can be controlled by configuring state transition throttling.
+
+Partition Distribution
+----------------------
+
+Following charts demonstrate the worst cases (min load vs. max load) and STDEVs of partition/master distributions of several Espresso PROD clusters.  
+If we measure the improvement by STDEV, CRUSH-ed reduces the partition distribution unevenness by 87% on average compared with CRUSH. And for master replica distribution, the evenness improvement is 68% on average and 81% in maximum.
+
+![Example](images/design/crushed/214844302.png)![Example](images/design/crushed/214844301.png)
+
+![Example](images/design/crushed/214844298.png)![Example](images/design/crushed/214844297.png)
+
+In addition, even for the clusters that already have good distribution using CRUSH, their real resource usage might be uneven because of partitions size variance.  
+For example, Venice-0 has more than 50k replicas. So even with CRUSH, partition distribution is already good, STDEV is 32.  
+However, because the partitions are quite different in storage usage, nodes' storage usage STDEV is 222. Max disk usage is about 2956GB.
+
+Then, if using CRUSH-ed, the partition distribution is improved a little bit, STDEV is 6.  
+Moreover, since all the resources have a uniform distribution, total storage usage is much evener compared with CRUSH. STDEV is 30. And max disk usage is about 2644GB.
+
+Given the host resource is reserved according to max usage, CRUSH-ed can save (2956 - 2644) / 2956 = 10.5% cost.
+
+![Example](images/design/crushed/214844300.png)![Example](images/design/crushed/214844299.png)
+
+Disabling Nodes
+---------------
+
+When nodes are offline or disabled, CRUSH-ed will re-assigned the partitions to other live nodes. The algorithm ensures only moving the necessary partitions.  
+We simulated disabling a certain number of nodes, and check the changes regarding partition movement and master changes. We also used the expected movement (the partitions/masters count on the disabled nodes) as a baseline to measure if any extra movements happen.
+
+The results show that movement is highly correlated to the portion of disabled nodes. And extra movements are minor (in most cases 0 movements).
+
+Note that **Rate** in this document is **the changed number / total partition or master count**.
+
+![Example](images/design/crushed/214844296.png)![Example](images/design/crushed/214844295.png)
+
+Rolling upgrade
+---------------
+
+Rolling upgrade is different from disabling nodes. Since nodes are reset one by one, in this test we assume the difference could be 2 nodes in maximum (for example, upgrading Node A then upgrading Node B).  
+In this case, movements are still minimized.
+
+Note that in real PROD clusters, since admin can set delayed scheduler, there won't be movement at all.  
+Following chart shows the maximum recorded movements or master changes during the whole rolling upgrade process. Even with the worst case, extra partition movements and master changes are close to 0%.  
+![Example](images/design/crushed/214844289.png)
+
+Adding Nodes
+------------
+
+Adding nodes (or cluster expansion) changes topology. CRUSH-ed leverages card dealing for evenness. So when topology changes, the movements would be much more than CRUSH.  
+As we evaluated, the moved partitions are majority uneven ones.
+
+According to our experiment using all Espresso clusters, the extra partition movement rate could be as much as 23%. And master changes could be as much as 58%.  
+The percentage is still compared with total partition count or total master count.
+
+Note the extra change rate is not correlated with the number of additional nodes. So our recommendation is finishing expansion in one operation so as to do only one partition shuffling.
+
+![Example](images/design/crushed/214844294.png)
+
+Algorithm Performance
+---------------------
+
+We compared CRUSH-ed with CRUSH algorithms using different instance numbers. The tests are executed multiple times and we recorded middle numbers.  
+The results are shown as follows.
+
+Note that CRUSH-ed does not cost much additional calculating time for regular rebalancing. In maximum, 30% more runtime compared with CRUSH. And it will be less than MultiRoundCRUSH.
+
+![Example](images/design/crushed/214844283.png)![Example](images/design/crushed/214844284.png)
+
+However, when there are nodes down since CRUSH-ed needs to run an additional consistent hashing based re-distribution, the calculating time will be much longer. More than 3 times compared with CRUSH.
+
+![Example](images/design/crushed/214844282.png)
+
+With some **performance improvements**, such as using cache to avoid duplicate calculation, we achieved to greatly reduce CRUSH-ed's running time. According to our experiment, it is now close to MultiRound CRUSH.
+
+![Example](images/design/crushed/214844281.png)
+
+Additional Tests for Corner Cases
+---------------------------------
+
+These tests prove that the algorithm still works in the extreme cases.
+
+**ESPRESSO_IDENTITY**
+
+This cluster has only one resource. We get the distribution details for all 4 fabrics.  
+  
+![Example](images/design/crushed/214844292.png)![Example](images/design/crushed/214844293.png)![Example](images/design/crushed/214844291.png)![Example](images/design/crushed/214844290.png)
+
+#### 100 nodes, 100 resources and 101 partitions for each resource
+
+The result is good. STDEV is 0.11. And the max loaded node has 104 partitions, while the min loaded node has 100 partitions.
+
+#### 100 nodes, 1 resource with 10100 partitions
+
+The result is strictly evenness. STDEV of the distribution is all 0.
+
+### Test without Fault Zone
+
+#### Distribution with CRUSH-ed
+
+|Total Repilca|Min Replica|Max Replica|Partition STDDEV|Min Master|Max Master|Master STDDEV|
+|---|---|---|---|---|---|---|
+|30720|520|523|0.09886598|170|177|0.21981398|
+
+#### Partition Change on Node down (disabled)
+
+|Number Of Node Change|Total Partition Movements|Moved Partition / Total Partition (%)|Extra Movements (not one the changed nodes)|Total Master Changes|Moved Masters / Total Masters (%)|Extra Changes (not one the changed nodes)|
+|---|---|---|---|---|---|---|
+|-7|3642|11.8554688|0|1212|11.8359375|0.09765625|
+
+### Test with 5 Fault Zones (11 participants in one zone and 12 participants in the rest)
+
+#### Distribution with CRUSH-ed
+
+|Total Repilca|Min Replica|Max Replica|Partition STDDEV|Min Master|Max Master|Master STDDEV|
+|---|---|---|---|---|---|---|
+|30720|518|524|0.12537857|169|177|0.20728582|
+
+#### Partition Change on Random Node down (disabled)
+
+|Number Of Node Change|Total Partition Movements|Moved Partition / Total Partition (%)|Extra Movements (not one the changed nodes)|Total Master Changes|Moved Masters / Total Masters (%)|Extra Changes (not one the changed nodes)|
+|---|---|---|---|---|---|---|
+|-7|3646|11.8684896|0|1206|11.7773438|0|
+
+#### Partition Change on a Whole Fault Zone down (disabled)
+
+|Number Of Node Change|Total Partition Movements|Moved Partition / Total Partition (%)|Extra Movements (not one the changed nodes)|Total Master Changes|Moved Masters / Total Masters (%)|Extra Changes (not one the changed nodes)|
+|---|---|---|---|---|---|---|
+|-12|6248|20.3385417|0|2079|20.3027344|0|
+
+#### Partition Change on half of the disabled Fault Zone back
+
+|Number Of Node Change|Total Partition Movements|Moved Partition / Total Partition (%)|Extra Movements (not one the changed nodes)|Total Master Changes|Moved Masters / Total Masters (%)|Extra Changes (not one the changed nodes)|
+|---|---|---|---|---|---|---|
+|+6|3767|12.2623698|0|1043|10.1855469|0|
+
+### Additional Tests for Validate Strategy Stability
+
+*   Disable and Re-enabled participants. The distribution is recovered.
+*   Keep bouncing the cluster by resetting random participants. The distribution is still good.
+
+Conclusion
+==========
+
+CRUSH-ed achieves more uniform distribution compared with CRUSH in the cost of longer running time and more partition movement when the cluster is changed.
+
+Moreover, according to our experiments, MultiRoundCRUSH-FEAP can achieve the best uniform distribution. But the gain does not match its additional cost. So it is not recommended.
+
+User Guide
+==========
+
+1.  Ensure the resouce is using FULL_AUTO mode.
+2.  Set rebalance strategy to be "org.apache.helix.controller.rebalancer.strategy.CrushEdRebalanceStrategy".
+3.  Expect more partition movement on topology changes when using the new strategy.
+
+**IdeaState SimpleFields Example** 
+
+    HELIX_ENABLED : "true"
+    IDEAL\_STATE\_MODE : "AUTO_REBALANCE"
+    REBALANCE\_MODE : "FULL\_AUTO"
+    REBALANCE_STRATEGY : "org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy"
+    MIN\_ACTIVE\_REPLICAS : "0"
+    NUM_PARTITIONS : "64"
+    REBALANCER\_CLASS\_NAME : "org.apache.helix.controller.rebalancer.DelayedAutoRebalancer"
+    REPLICAS : "1"
+    STATE\_MODEL\_DEF_REF : "LeaderStandby"
+
+Future Works
+============
+
+Instance Level Capacity Limitation
+----------------------------------
+
+Currently, all resources are assigned separately.  
+The pros of this design are that resources change won't cause existing partitions to be re-assigned.  
+The cons are:
+
+1.  It's hard to ensure strict global uniform distribution.
+2.  Instance level capacity control is not possible given the algorithm doesn't have a global view of partition assignment.
+
+Rebalance Algorithm Takes Partition Weight into Consideration
+-------------------------------------------------------------
+
+This algorithm still considers all partitions to be equally weighted. But in reality, different partitions may have different resource requirements.  
+Application admins need to configure partition weight and Helix should assignment them accordingly.
+
+Note this feature only makes sense when it is applied to a global assignment algorithm since each partition in the same resource are weighted the same.

[2/2] helix git commit: made edits to CrushED design doc

Posted by lx...@apache.org.

made edits to CrushED design doc


Project: http://git-wip-us.apache.org/repos/asf/helix/repo
Commit: http://git-wip-us.apache.org/repos/asf/helix/commit/ad510984
Tree: http://git-wip-us.apache.org/repos/asf/helix/tree/ad510984
Diff: http://git-wip-us.apache.org/repos/asf/helix/diff/ad510984

Branch: refs/heads/master
Commit: ad51098489e22c59da1b7f39292968a0f1bff718
Parents: 1f69e5a
Author: Eric Kim <er...@linkedin.com>
Authored: Fri Jun 1 14:15:24 2018 -0700
Committer: Eric Kim <er...@linkedin.com>
Committed: Fri Jun 1 14:15:24 2018 -0700

----------------------------------------------------------------------
 .../0.8.1/src/site/markdown/design_crushed.md   | 26 +++-----------------
 1 file changed, 4 insertions(+), 22 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/helix/blob/ad510984/website/0.8.1/src/site/markdown/design_crushed.md
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/markdown/design_crushed.md b/website/0.8.1/src/site/markdown/design_crushed.md
index 7fd1773..db755f6 100644
--- a/website/0.8.1/src/site/markdown/design_crushed.md
+++ b/website/0.8.1/src/site/markdown/design_crushed.md
@@ -32,17 +32,17 @@ Since the problem is NP-hard. We are not expecting the best assignment. A greedy
 After we tried different designs, we found it's hard to achieve both goals (even distribution and fewer movements) using a single strategy. So we decided to apply a hybrid algorithm that finishes the work step by step.
 
 **Step 1, run CRUSH to get a base assignment.**  
-The base assignment usually contains a certain number of uneven partitions, so we need the following steps to re-distribute them.
+The base assignment usually contains a certain number of uneven partitions(i.e. extra partitions above perfect distribution), so we need the following steps to re-distribute them.
 
 **Step 2, run a card dealing algorithm on the uneven parts.**  
-And assign them to idle nodes. This algorithm is conceptually simple. The result ensures that all partitions are assigned to instances with minimum difference. Note that when fault zone joins the game, our greedy algorithm may not be able to calculate possible results because the candidate assignment may have fault zone conflict. So we add the buffer to tolerate small uneven assignment.
+Assign extra partitions to under-loaded nodes, using card dealing strategy. This algorithm is conceptually simple. The result ensures that all partitions are assigned to instances with minimum difference. When gauranteeing fault-zone safe assignment, our greedy algorithm may not be able to calculate possible results because of fault-zone conflict. 
 
 Example of assignments after step 2,
 
 ![Example](images/design/crushed/example-cluster-partition-dist.png)
 
 **Step 3, Shuffle partitions' preference lists.**  
-Since replica states are assigned according to node order in these lists, if the lists are randomly ordered, State assignment (i.e. Master, Slave, Online, Offline) will also be random, so this may result in uneven states distribution. To resolve this issue, CrushED assigns scores to nodes as it computes pref list, to give all nodes equal chances in appearing at the top of the pref list. This operation results in a much more even state distribution.
+State assignments (i.e. Master, Slave, Online, Offline, etc)  are made according to preflist, ordered node. When using randomly ordered lists, State assignment is also random, and it may result in uneven state distribution. To resolve this issue, CrushED assigns scores to nodes as it computes pref list, to give all nodes equal chances in appearing at the top of the pref list. This operation results in a much more even state distribution.
 
 Example of master distribution before step 3,
 
@@ -56,8 +56,7 @@ Example of master distribution after step 3,
 Consistent hashing ensures minimize partition movement.  
 Note that the first 3 steps are using full node list, regardless of disabled or offline nodes. So the assignment will be stable even the algorithm contains random factors such hashCode. Then step 4 ensures all the disabled nodes are handled correctly without causing huge partition movements.
 
-One potential issue of using intuitive algorithm is not converging. In this case, CrushED falls back to CRUSH.  
-Pseudocode is listed below.
+Pseudocode of above algorithm is as follows :
 
 **Pseudo Code** 
 
@@ -172,20 +171,3 @@ CrushED achieves more uniform distribution compared with CRUSH at the cost of hi
     REPLICAS : "1"
     STATE\_MODEL\_DEF_REF : "LeaderStandby"
 
-## Future Works
-
-**Instance Level Capacity Limitation**
-
-Currently, all resources are assigned separately.  
-The pros of this design are that resources change won't cause existing partitions to be re-assigned.  
-The cons are:
-
-1.  It's hard to ensure strict global uniform distribution.
-2.  Instance level capacity control is not possible given the algorithm doesn't have a global view of partition assignment.
-
-**Rebalance Algorithm Takes Partition Weight into Consideration**
-
-This algorithm still considers all partitions to be equally weighted. But in reality, different partitions may have different resource requirements.  
-Application admins need to configure partition weight and Helix should assignment them accordingly.
-
-Note this feature only makes sense when it is applied to a global assignment algorithm since each partition in the same resource are weighted the same.