You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by Apache Wiki <wi...@apache.org> on 2013/10/28 19:16:58 UTC

[Samza Wiki] Update of "Pluggable MessageChooser" by ChrisRiccomini

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Samza Wiki" for change notification.

The "Pluggable MessageChooser" page has been changed by ChrisRiccomini:
https://wiki.apache.org/samza/Pluggable%20MessageChooser?action=diff&rev1=9&rev2=10

   * Choose the message with the smallest offset.
   * Choose the message from the stream with the most unprocessed messages.
   * Choose the message from the stream that is the farthest % behind head.
+  * Choose the message that keeps streams most aligned with their alignment at start time.
  
  ==== Offsets ====
  
@@ -176, +177 @@

  
  ==== Messages behind head ====
  
- I need to think about this a bit more.
+ This approach would require changing the API to allow the !MessageChooser to know how far behind a message was from head. 
+ 
+ The behavior of this strategy is not desirable in cases where we have one very high throughput stream, and one very low throughput stream. In such a case, the low throughput stream might be 100 messages behind, which might be equivalent to two days in wall-clock time. The high throughput stream might be 1000 messages behind, which might be equivalent to 10 seconds behind, in wall-clock time. Using this strategy, the high throughput stream's message would be picked. This is counter intuitive, since we're trying to find a proxy for time alignment, and this approach is actually not aligning by time in this scenario.
  
  ==== Percent behind head ====
  
- I need to think about this a bit more.
+ This approach would require changing the API to allow the !MessageChooser to know (or derive) what percentage behind head a message is.
+ 
+ In cases where you have two input created at dramatically different points in time (for example a year ago, and a day ago), the percentage behind head is a misleading measurement. 50% behind a stream that was created a year ago means you have half a year's worth of messages to process. 90% behind a stream that was created a day ago means you have 21 hours of messages to process. In this scenario, this strategy would pick messages from the stream created yesterday, even though it's actually much closer to "now" in wall-clock time. This, again, is counter intuitive, since our goal is to find a proxy for time-alginment.
+ 
+ ==== Maintain starting alignment ====
+ 
+ This approach would require changing the API to allow the !MessageChooser to know how far a message was from head.
+ 
+ It appears that we can't come up with a good general-purpose proxy for time alignment. In the absence of such a strategy, the next best thing to do seems to be just to guarantee maintaining the alignment of the streams that existed before the Samza job started. 
+ 
+ For example, take the case where there are two input streams at job-start time: one that's 100 messages behind, and the other that's 1000 messages behind. There are three possible states for the streams to be in with this example: both streams are behind their starting alignments (> 100 messages and > 1000 messages behind, respectively), one stream is behind and the other is ahead of its starting alignment (> 100 messages behind and < 1000 messages behind), or both streams are ahead of their starting alignment. The strategy for the !MessageChooser then becomes:
+ 
+  1. All behind: pick a message from the stream that is farthest behind its original alignment (in terms of number of messages).
+  2. Partially behind: pick a message from the stream that is farthest behind its original alignment (in terms of number of messages).
+  3. All ahead: pick a message from the stream that is closest to its original alignment (in terms of number of messages).
+ 
+ Choosing messages when all streams are ahead of their original alignments (3) is an interesting case. Taking our original example, if the 100-message-behind-stream is now 90 messages behind head, then it is 10 messages from its original alignment. If the 1000-messages-behind-stream is now 900 messages behind head, then it is 100 messages from its original alignment. The !MessageChooser would pick the stream that is 10 messages ahead of its original alignment, because it is deemed to be "closest" to its original alignment.
+ 
+ I'm going to declare this strategy as out of scope, but open up a JIRA to track it.
  
  === DefaultChooser ===