You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by cr...@apache.org on 2013/08/24 01:07:09 UTC

git commit: updating arch page to answer stone's question about key to topic/partition mapping.

Updated Branches:
  refs/heads/master f93f7eabe -> ff6073d9b


updating arch page to answer stone's question about key to topic/partition mapping.


Project: http://git-wip-us.apache.org/repos/asf/incubator-samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-samza/commit/ff6073d9
Tree: http://git-wip-us.apache.org/repos/asf/incubator-samza/tree/ff6073d9
Diff: http://git-wip-us.apache.org/repos/asf/incubator-samza/diff/ff6073d9

Branch: refs/heads/master
Commit: ff6073d9b447ef5d86442b457fc20ae13335da13
Parents: f93f7ea
Author: Chris Riccomini <cr...@criccomi-mn.linkedin.biz>
Authored: Fri Aug 23 16:07:02 2013 -0700
Committer: Chris Riccomini <cr...@criccomi-mn.linkedin.biz>
Committed: Fri Aug 23 16:07:02 2013 -0700

----------------------------------------------------------------------
 docs/learn/documentation/0.7.0/introduction/architecture.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/ff6073d9/docs/learn/documentation/0.7.0/introduction/architecture.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/introduction/architecture.md b/docs/learn/documentation/0.7.0/introduction/architecture.md
index 74470d1..b733f69 100644
--- a/docs/learn/documentation/0.7.0/introduction/architecture.md
+++ b/docs/learn/documentation/0.7.0/introduction/architecture.md
@@ -83,7 +83,7 @@ Let's take a look at a real example. Suppose that we wanted to count page views
 
 ![diagram-large](/img/0.7.0/learn/documentation/introduction/group-by-example.png)
 
-The input topic is partitioned using Kafka. Each Samza process reads messages from one or more of the input topic's partitions, and emits them back out to a different Kafka topic keyed by the message's member ID attribute. The Kafka brokers receive these messages, and buffer them on disk until the second job (the counting job on the bottom of the diagram) reads the messages, and increments its counters.
+The input topic is partitioned using Kafka. Each Samza process reads messages from one or more of the input topic's partitions, and emits them back out to a different Kafka topic. Each output message is keyed by the message's member ID attribute, and this key is mapped to one of the topic's partitions (usually by hashing the key, and modding by the number of partitions in the topic). The Kafka brokers receive these messages, and buffer them on disk until the second job (the counting job on the bottom of the diagram) reads the messages, and increments its counters.
 
 There are some neat things to consider about this example. First, we're leveraging the fact that Kafka topics are inherently partitioned. This lets us run one or more Samza processes, and assign them each some partitions to read from. Second, since we're guaranteed that, for a given key, all messages will be on the same partition, we can actually split up the aggregation (counting). For example, if the first job's output had four partitions, we could assign two partitions to the first count process, and the other two partitions to the second count process. We'd be guaranteed that for any give member ID, all of their messages will be consumed by either the first process or the second, but not both. This means we'll get accurate counts, even when partitioning. Third, the fact that we're using Kafka, which buffers messages on its brokers, also means that we don't have to worry as much about failures. If a process or machine fails, we can use YARN to start the process on another machine
 . When the process starts up again, it can get its last offset, and resume reading messages where it left off.