You are viewing a plain text version of this content. The canonical link for it is here.
Posted to s4-commits@incubator.apache.org by ki...@apache.org on 2012/12/20 19:06:49 UTC

[7/19] git commit: Update README.md

Update README.md

Project: http://git-wip-us.apache.org/repos/asf/incubator-s4/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-s4/commit/f9e09986
Tree: http://git-wip-us.apache.org/repos/asf/incubator-s4/tree/f9e09986
Diff: http://git-wip-us.apache.org/repos/asf/incubator-s4/diff/f9e09986

Branch: refs/heads/S4-110
Commit: f9e09986462ca542133d18c839796fd90199fe07
Parents: 4d9581b
Author: Kishore Gopalakrishna <g....@gmail.com>
Authored: Thu Nov 29 20:46:10 2012 -0800
Committer: Kishore Gopalakrishna <g....@gmail.com>
Committed: Thu Nov 29 20:46:10 2012 -0800

----------------------------------------------------------------------
 README.md |   15 +++++++--------
 1 files changed, 7 insertions(+), 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-s4/blob/f9e09986/README.md
----------------------------------------------------------------------
diff --git a/README.md b/README.md
index 58373f3..b5cedc7 100644
--- a/README.md
+++ b/README.md
@@ -8,15 +8,14 @@ Goal is to provide better partition management, fault tolerance and automatic re
 
 Limitation in S4
    * Tasks are always partitioned based on the number of nodes
-   * If the stream is already partitioned outside of s4 or in a pub-sub system but number of nodes in s4 cluster is different, then the stream needs to be re-partitioned which results in additional hop. In cases where multiple streams that are already partitioned outside of S4 but needs to be joined in S4, it requires re-hashing both the streams.
-   * When the system needs to scale, adding new nodes mean the number of partitions change. This results in lot of data shuffling and possibly losing all the state that is stored.
-   * Also the fault tolerance is achieved by having stand alone nodes that remain idle and becomes active when a node fails. This results in inefficient use of hardware resources.
+   * If the stream is already partitioned outside of s4 or with a different partitioning factor, then the stream needs to be re-partitioned which results in additional hop. In cases where multiple streams that are already partitioned outside of S4 but needs to be joined in S4, it requires re-hashing both the streams.
+   * Adding new nodes to S4 cluster cause the number of partitions to change. This results in lot of data shuffling. For example if there are 4 nodes and you add another node then 80% of the data is moved from its current location to a new location where as ideally only 20% of the data should move.
+   * Fault tolerance is achieved by having standby nodes which become active when a node fails. This results in inefficient use of hardware resources.
    
-Integrating with Apache Helix, 
-   * Allows one to partition the task processing differently for each stream. 
-   * Allows adding new nodes without having to change the number of partitions. 
-   * Fault tolerance can be achieved at a partition level which means all nodes can be active and when a node fails its load can be equally balanced among the remaining active nodes.
-
+Advantages of integrating with Apache Helix, 
+   * Allows one to partition the task processing differently for each stream, which provides better load balancing. 
+   * When new nodes are added, partitions can be migrated to new nodes without changing the number of partitions. This minimizes the data movement.
+   * All Nodes in the cluster are active and when a node fails, its load gets redistributed among the remaining active nodes
 
 This is still in prototype mode.