You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Yan Fang <ya...@gmail.com> on 2014/07/09 10:30:02 UTC

Review Request 23358: SAMZA-225

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/
-----------------------------------------------------------

Review request for samza.


Repository: samza


Description
-------

Comparison of Spark Streaming and Samza


Diffs
-----

  docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
  docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
  docs/learn/documentation/0.7.0/index.html 149ff2b 

Diff: https://reviews.apache.org/r/23358/diff/


Testing
-------


Thanks,

Yan Fang


Re: Review Request 23358: SAMZA-225

Posted by Martin Kleppmann <mk...@linkedin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review47736
-----------------------------------------------------------


I started commenting on this RB, but then noticed that the RB doesn't reflect the latest patch. Could you update the RB please? Here are just some comments on the first few paragraphs, I'll go over the rest when it's up-to-date.


docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment83927>

    Storm can run in two different modes: with its lower-level API of bolts, it processes messages as they are received, whereas its higher-level Trident API performs batching (somewhat similarly to Spark Streaming). As that subtlety isn't really relevant here, I'd suggest just removing the mention of Storm in this paragraph.
    
    "Discretized stream ... is a continuous sequence": I find this juxtaposition of "discrete" and "continuous" a bit jarring. It's confusing because you're using "continuous" in the sense of "neverending", but "continuous" is sometimes also used as the opposite of "discrete". "continuous" is just a bit ambiguous here.
    
    Perhaps say something like: "Spark Streaming groups the stream into batches of a fixed duration (such as 1 second). Each batch is represented as an [RDD](...) file. A neverending sequence of these RDDs is called a _discretized stream_ ([DStream](...))."



docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment83929>

    "let me give a brief overview": we're not using the first person elsewhere in the docs; I'd prefer to keep the tone consistent.
    
    You could probably also break these long paragraphs up into shorter ones, to make them easier to read. Bullet points or numbered lists also help readability. "There are two main parts...:" suggests a good place for a numbered list, for example.



docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment83930>

    s/core number/number of cores/


- Martin Kleppmann


On July 11, 2014, 8:37 a.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 11, 2014, 8:37 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>


Re: Review Request 23358: SAMZA-225

Posted by Martin Kleppmann <ma...@kleppmann.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review48533
-----------------------------------------------------------

Ship it!


Minor typos, otherwise I think this is good to go. Good work!


docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment85187>

    Typo: "Saprk".
    
    Missing period at the end of the paragraph.


- Martin Kleppmann


On July 23, 2014, 6:13 a.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 23, 2014, 6:13 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>


Re: Review Request 23358: SAMZA-225

Posted by Yan Fang <ya...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/
-----------------------------------------------------------

(Updated July 23, 2014, 6:13 a.m.)


Review request for samza.


Changes
-------

updated based on Martin's comments.


Repository: samza


Description
-------

Comparison of Spark Streaming and Samza


Diffs (updated)
-----

  docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
  docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
  docs/learn/documentation/0.7.0/index.html 149ff2b 

Diff: https://reviews.apache.org/r/23358/diff/


Testing
-------


Thanks,

Yan Fang


Re: Review Request 23358: SAMZA-225

Posted by Yan Fang <ya...@gmail.com>.

> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 42
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line42>
> >
> >     Does this state DStream provide any key-value access or other query model? If it's just a stream of records, that would imply that every time a batch of input records is processed, the stream processor also needs to consume the entire state DStream. That's fine if the state is small, but with a large amount of state (multiple GB), it would probably get very inefficient. If this is true, it would further support our "Samza is good if you have lots of state" story.
> >     
> >     Also: you don't mention anything about stream joins in this comparison. I see Spark has a join operator -- do you know what it does? Does it just take one batch from each input stream and join within those batches? Or can you do joins across a longer window, or against a table?
> >     
> >     Since joins typically involve large amounts of state, they are worth highlighting as an area where Samza may be stronger.
> >     
> >     "Everytime updateStateByKey is applied, you will get a new state DStream": presumably you get a new DStream once per batch, not for every single message within a batch?
> 
> Yan Fang wrote:
>     AFAIK, no other methods. will update when I know. It's a little interesting in Spark Streaming. Seems it only updates the state of the keys when the keys appear in this time interval. (because updateStateByKey only is called every time interval). So maybe there is not concern of "consume the entire state DStream", instead, the concern is "how can I change the previous state and other key's state". asking this in the community now.
>     
>     "join" is a little tricky. You can join two DStreams in the same time interval, meaning that, you can join two batches received from the same time interval but can not join two DStreams that have different time intervals, such as a realtime batch and a window batch.
>     
>     Yes, once per batch, not for single message. will emphasize this.

Update:

The following statement is wrong: "Seems it only updates the state of the keys when the keys appear in this time interval. (because updateStateByKey only is called every time interval).".

You were right. Every time, the stream processor needs to consume the entire state DStream. Spark Streaming has the inefficiency when the state is very big. And they are working on this: https://issues.apache.org/jira/browse/SPARK-2365  . Will mention this in the updated version of doc.


- Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review47895
-----------------------------------------------------------


On July 15, 2014, 6:15 p.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 15, 2014, 6:15 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>


Re: Review Request 23358: SAMZA-225

Posted by Yan Fang <ya...@gmail.com>.

> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 36
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line36>
> >
> >     This paragraph seems contradictory -- does Spark guarantee ordering or not? And what do you mean with "is not emphasized in the document"?
> >     
> >     My understanding is that Spark's transformation operators must be side-effect-free, so the order in which batches are processed is irrelevant. When one batch depends on the output of a previous batch (e.g. a window-based operation), Spark Streaming guarantees that the correct previous batch is used as input to the subsequent batch (which is effectively ordering, even if some of the execution may actually happen in parallel).
> >     
> >     I'm not sure about ordering of output operations (which may have side-effects).
> >     
> >     Another thing -- I believe Spark Streaming requires transformation operators to be deterministic. Is that true? If so, it would be worth mentioning, because 
> >     that may make it unsuitable for nondeterministic processing, e.g. a randomized machine learning algorithm. Samza has no such requirement.
> >     
> >     "Spark Streaming supports at-least once messaging semantics": you say below that Spark Streaming may lose messages if the receiver task fails. If this is the case, the guarantee is neither at-least-once nor at-most-once, but more like zero-or-more-times.

When I say "is not emphasized in the document", mean that I could not find relevant documents. From my test, the messages order in one DStream seems guaranteed. But if you combine some DStreams in the process, no order is guaranteed. 

you are right, transformation operations are side-effect-free and output operations (should) have the side-effects. And all transformation operations only happen after output operations are called (because of lazy implementation). 

So I am a little conservative about the order of messages in Spark Streaming in case I write something wrong.

yes for "transformation operators to be deterministic". Because you only apply the operations in a deterministic stream. Will mention that.

they do lose data and work on that. https://issues.apache.org/jira/browse/SPARK-1730, https://issues.apache.org/jira/browse/SPARK-1647. It's a little weird in Kafka situation. Because of the consumer offset, it does not lose data but processes too many messages at the first, say , 2s, when you bring up the receiver after the failure. Maybe I should mention it as well?


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 44
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line44>
> >
> >     Typo: "intermedia"
> >     
> >     Is this periodic writing of state to HDFS basically a form of checkpointing? If so, it might be worth calling it such.
> >     
> >     The state management page of the docs also calls out that checkpointing the entire task state is inefficient if the state is large. Could do a cross-reference.

yes. quote from mailing list "After every checkpointing interval, the latest state RDD is stored to HDFS in its entirety. Along with that, the series of DStream transformations that was setup with the streaming context is also stored into HDFS (the whole DAG of DStream objects is serialized and saved)."


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 42
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line42>
> >
> >     Does this state DStream provide any key-value access or other query model? If it's just a stream of records, that would imply that every time a batch of input records is processed, the stream processor also needs to consume the entire state DStream. That's fine if the state is small, but with a large amount of state (multiple GB), it would probably get very inefficient. If this is true, it would further support our "Samza is good if you have lots of state" story.
> >     
> >     Also: you don't mention anything about stream joins in this comparison. I see Spark has a join operator -- do you know what it does? Does it just take one batch from each input stream and join within those batches? Or can you do joins across a longer window, or against a table?
> >     
> >     Since joins typically involve large amounts of state, they are worth highlighting as an area where Samza may be stronger.
> >     
> >     "Everytime updateStateByKey is applied, you will get a new state DStream": presumably you get a new DStream once per batch, not for every single message within a batch?

AFAIK, no other methods. will update when I know. It's a little interesting in Spark Streaming. Seems it only updates the state of the keys when the keys appear in this time interval. (because updateStateByKey only is called every time interval). So maybe there is not concern of "consume the entire state DStream", instead, the concern is "how can I change the previous state and other key's state". asking this in the community now.

"join" is a little tricky. You can join two DStreams in the same time interval, meaning that, you can join two batches received from the same time interval but can not join two DStreams that have different time intervals, such as a realtime batch and a window batch.

Yes, once per batch, not for single message. will emphasize this.


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 50
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line50>
> >
> >     "send them to executors" -> "sending them to executors"
> >     
> >     "the parallelism is simply accomplished by normal RDD operations, such as map, reduceByKey, reduceByWindow" -- this is unclear. How are these operations parallelized? I assume each RDD batch is processed sequentially by a single processing task, but different batches can be processed on different machines. Is that right?
> >     
> >     If your input stream is partitioned (i.e. there are multiple receivers), does each batch include only messages from a single partition, or are all the partitions combined into a single batch? Can you repartition a stream, e.g. when grouping on a field, to ensure that all messages with the same value in that field get grouped together? (akin to the shuffle phase in MapReduce)
> >     
> >     How does partitioning play into the ordering guarantees? Even if each partition is ordered, there typically isn't a deterministic ordering of messages across partitions; how does this interact with Spark's determinism requirement for operators?

"How are these operations parallelized? I assume each RDD batch is processed sequentially by a single processing task, but different batches can be processed on different machines. Is that right?"
  Not quite sure. Will update when I know.

"If your input stream is partitioned (i.e. there are multiple receivers), does each batch include only messages from a single partition, or are all the partitions combined into a single batch? Can you repartition a stream, e.g. when grouping on a field, to ensure that all messages with the same value in that field get grouped together? (akin to the shuffle phase in MapReduce)"

Actually when you partition the stream, every partition is a DStream. Then whatever you want to do is based on the DStream. So in the situation where you want to have a "whole" stream containing all the partitions, you will use the "union" to put them all together as a one DStream. Then "group by" is somehow provided by "ReduceByKey". In general Spark (other projects), they have groupByKey, but not supported in Spark Streaming.

Yes, you are right. No deterministic ordering of messages across partitions. Not quite sure about " Spark's determinism requirement for operators". What does this mean?


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 86
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line86>
> >
> >     Is this really true? I get the impression that with Spark Streaming you build an entire processing graph with a DSL API, and deploy that entire graph as one unit. The communication between the nodes in that graph (in the form of DStreams) is provided by the framework. In that way it looks similar to Storm.
> >     
> >     Samza is totally different -- each job is just a message-at-a-time processor, and there is no framework support for topologies. Output of a processing task always needs to go back to a message broker (e.g. Kafka).
> >     
> >     A positive consequence of Samza's design is that a job's output can be consumed by multiple unrelated jobs, potentially run by different teams, and those jobs are isolated from each other through Kafka's buffering. That is not the case with Storm's (and Spark Streaming's?) framework-internal streams.
> >     
> >     Although a Storm/Spark job could in principle write its output to a message broker, the framework doesn't really make this easy. It seems that Storm/Spark aren't intended to used in a way where one topology's output is another topology's input. By contrast, in Samza, that mode of usage is standard.

yes, I should update this part to emphasize the difference.


- Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review47895
-----------------------------------------------------------


On July 15, 2014, 6:15 p.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 15, 2014, 6:15 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>


Re: Review Request 23358: SAMZA-225

Posted by Yan Fang <ya...@gmail.com>.

> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 36
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line36>
> >
> >     This paragraph seems contradictory -- does Spark guarantee ordering or not? And what do you mean with "is not emphasized in the document"?
> >     
> >     My understanding is that Spark's transformation operators must be side-effect-free, so the order in which batches are processed is irrelevant. When one batch depends on the output of a previous batch (e.g. a window-based operation), Spark Streaming guarantees that the correct previous batch is used as input to the subsequent batch (which is effectively ordering, even if some of the execution may actually happen in parallel).
> >     
> >     I'm not sure about ordering of output operations (which may have side-effects).
> >     
> >     Another thing -- I believe Spark Streaming requires transformation operators to be deterministic. Is that true? If so, it would be worth mentioning, because 
> >     that may make it unsuitable for nondeterministic processing, e.g. a randomized machine learning algorithm. Samza has no such requirement.
> >     
> >     "Spark Streaming supports at-least once messaging semantics": you say below that Spark Streaming may lose messages if the receiver task fails. If this is the case, the guarantee is neither at-least-once nor at-most-once, but more like zero-or-more-times.
> 
> Yan Fang wrote:
>     When I say "is not emphasized in the document", mean that I could not find relevant documents. From my test, the messages order in one DStream seems guaranteed. But if you combine some DStreams in the process, no order is guaranteed. 
>     
>     you are right, transformation operations are side-effect-free and output operations (should) have the side-effects. And all transformation operations only happen after output operations are called (because of lazy implementation). 
>     
>     So I am a little conservative about the order of messages in Spark Streaming in case I write something wrong.
>     
>     yes for "transformation operators to be deterministic". Because you only apply the operations in a deterministic stream. Will mention that.
>     
>     they do lose data and work on that. https://issues.apache.org/jira/browse/SPARK-1730, https://issues.apache.org/jira/browse/SPARK-1647. It's a little weird in Kafka situation. Because of the consumer offset, it does not lose data but processes too many messages at the first, say , 2s, when you bring up the receiver after the failure. Maybe I should mention it as well?
> 
> Martin Kleppmann wrote:
>     "not emphasized": maybe say that in Spark, since messages are processed in batches by side-effect-free operators, the exact ordering of messages is not important in Spark.
>     
>     Good find on the data loss issues, I'd suggest linking to SPARK-1647. I don't understand the issue with Kafka. When it comes back after a failure, does it start consuming from the latest offset, or some older offset?
> 
> Yan Fang wrote:
>     In terms of Kafka, when Spark Streaming restarts, it starts from the older offset where it fails. That means, if Spark Streaming is using Kafka as the input stream, it will not lose data in a receiver/driver failure scenario. However, since there are many unprocessed messages in the Kafka ( because it does not consume any data during the failure time), it will consume all the unprocessed messages at the first interval. After that, it goes to normal situation where it consumes as the same rate as the data is coming. Now they have a patch https://issues.apache.org/jira/browse/SPARK-1341 to control the rate.
>     
>     But for sure it loses data if it's using Flume/Twitter data as the input stream.
> 
> Martin Kleppmann wrote:
>     With "older offset", do you mean the oldest offset (which might be data that is several weeks old)? Or from the last checkpoint?
>     
>     If you're using an input system that doesn't buffer unprocessed messages, like Flume, Twitter or the IRC example in hello-samza, then Samza similarly loses data on container restart. There's probably no way around that. So it looks like Samza and Spark are actually the same in that regard?

Yes. sadly, Samza and Spark are the same in this regard... If Spark Streaming is using Kafka, it does not lose data either because Kafka does not lose data. So Spark Stream restarts when it's left last time. Since Samza is using Kafka by default, it seems unfair to compare using Kafka with using Flume/Twitter...


- Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review47895
-----------------------------------------------------------


On July 15, 2014, 6:15 p.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 15, 2014, 6:15 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>


Re: Review Request 23358: SAMZA-225

Posted by Martin Kleppmann <ma...@kleppmann.com>.

> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 36
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line36>
> >
> >     This paragraph seems contradictory -- does Spark guarantee ordering or not? And what do you mean with "is not emphasized in the document"?
> >     
> >     My understanding is that Spark's transformation operators must be side-effect-free, so the order in which batches are processed is irrelevant. When one batch depends on the output of a previous batch (e.g. a window-based operation), Spark Streaming guarantees that the correct previous batch is used as input to the subsequent batch (which is effectively ordering, even if some of the execution may actually happen in parallel).
> >     
> >     I'm not sure about ordering of output operations (which may have side-effects).
> >     
> >     Another thing -- I believe Spark Streaming requires transformation operators to be deterministic. Is that true? If so, it would be worth mentioning, because 
> >     that may make it unsuitable for nondeterministic processing, e.g. a randomized machine learning algorithm. Samza has no such requirement.
> >     
> >     "Spark Streaming supports at-least once messaging semantics": you say below that Spark Streaming may lose messages if the receiver task fails. If this is the case, the guarantee is neither at-least-once nor at-most-once, but more like zero-or-more-times.
> 
> Yan Fang wrote:
>     When I say "is not emphasized in the document", mean that I could not find relevant documents. From my test, the messages order in one DStream seems guaranteed. But if you combine some DStreams in the process, no order is guaranteed. 
>     
>     you are right, transformation operations are side-effect-free and output operations (should) have the side-effects. And all transformation operations only happen after output operations are called (because of lazy implementation). 
>     
>     So I am a little conservative about the order of messages in Spark Streaming in case I write something wrong.
>     
>     yes for "transformation operators to be deterministic". Because you only apply the operations in a deterministic stream. Will mention that.
>     
>     they do lose data and work on that. https://issues.apache.org/jira/browse/SPARK-1730, https://issues.apache.org/jira/browse/SPARK-1647. It's a little weird in Kafka situation. Because of the consumer offset, it does not lose data but processes too many messages at the first, say , 2s, when you bring up the receiver after the failure. Maybe I should mention it as well?

"not emphasized": maybe say that in Spark, since messages are processed in batches by side-effect-free operators, the exact ordering of messages is not important in Spark.

Good find on the data loss issues, I'd suggest linking to SPARK-1647. I don't understand the issue with Kafka. When it comes back after a failure, does it start consuming from the latest offset, or some older offset?


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 42
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line42>
> >
> >     Does this state DStream provide any key-value access or other query model? If it's just a stream of records, that would imply that every time a batch of input records is processed, the stream processor also needs to consume the entire state DStream. That's fine if the state is small, but with a large amount of state (multiple GB), it would probably get very inefficient. If this is true, it would further support our "Samza is good if you have lots of state" story.
> >     
> >     Also: you don't mention anything about stream joins in this comparison. I see Spark has a join operator -- do you know what it does? Does it just take one batch from each input stream and join within those batches? Or can you do joins across a longer window, or against a table?
> >     
> >     Since joins typically involve large amounts of state, they are worth highlighting as an area where Samza may be stronger.
> >     
> >     "Everytime updateStateByKey is applied, you will get a new state DStream": presumably you get a new DStream once per batch, not for every single message within a batch?
> 
> Yan Fang wrote:
>     AFAIK, no other methods. will update when I know. It's a little interesting in Spark Streaming. Seems it only updates the state of the keys when the keys appear in this time interval. (because updateStateByKey only is called every time interval). So maybe there is not concern of "consume the entire state DStream", instead, the concern is "how can I change the previous state and other key's state". asking this in the community now.
>     
>     "join" is a little tricky. You can join two DStreams in the same time interval, meaning that, you can join two batches received from the same time interval but can not join two DStreams that have different time intervals, such as a realtime batch and a window batch.
>     
>     Yes, once per batch, not for single message. will emphasize this.
> 
> Yan Fang wrote:
>     Update:
>     
>     The following statement is wrong: "Seems it only updates the state of the keys when the keys appear in this time interval. (because updateStateByKey only is called every time interval).".
>     
>     You were right. Every time, the stream processor needs to consume the entire state DStream. Spark Streaming has the inefficiency when the state is very big. And they are working on this: https://issues.apache.org/jira/browse/SPARK-2365  . Will mention this in the updated version of doc.

Ok, good. I'd suggest explicitly mentioning the join limitation, as well.


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 50
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line50>
> >
> >     "send them to executors" -> "sending them to executors"
> >     
> >     "the parallelism is simply accomplished by normal RDD operations, such as map, reduceByKey, reduceByWindow" -- this is unclear. How are these operations parallelized? I assume each RDD batch is processed sequentially by a single processing task, but different batches can be processed on different machines. Is that right?
> >     
> >     If your input stream is partitioned (i.e. there are multiple receivers), does each batch include only messages from a single partition, or are all the partitions combined into a single batch? Can you repartition a stream, e.g. when grouping on a field, to ensure that all messages with the same value in that field get grouped together? (akin to the shuffle phase in MapReduce)
> >     
> >     How does partitioning play into the ordering guarantees? Even if each partition is ordered, there typically isn't a deterministic ordering of messages across partitions; how does this interact with Spark's determinism requirement for operators?
> 
> Yan Fang wrote:
>     "How are these operations parallelized? I assume each RDD batch is processed sequentially by a single processing task, but different batches can be processed on different machines. Is that right?"
>       Not quite sure. Will update when I know.
>     
>     "If your input stream is partitioned (i.e. there are multiple receivers), does each batch include only messages from a single partition, or are all the partitions combined into a single batch? Can you repartition a stream, e.g. when grouping on a field, to ensure that all messages with the same value in that field get grouped together? (akin to the shuffle phase in MapReduce)"
>     
>     Actually when you partition the stream, every partition is a DStream. Then whatever you want to do is based on the DStream. So in the situation where you want to have a "whole" stream containing all the partitions, you will use the "union" to put them all together as a one DStream. Then "group by" is somehow provided by "ReduceByKey". In general Spark (other projects), they have groupByKey, but not supported in Spark Streaming.
>     
>     Yes, you are right. No deterministic ordering of messages across partitions. Not quite sure about " Spark's determinism requirement for operators". What does this mean?

"Spark's determinism requirement for operators": I mean that a Spark operator is required to always produce the same output when given the same input. Question is: if an operator is given the same set of input messages in a different order, is it still required to produce the same output? If yes, and if the order of input messages to an operator is not guaranteed, then that puts a limitation on what kinds of operator you can implement.

What I'm getting at here: if you want to write a custom Spark operator, it seems that you have to obey a lot of constraints in order to satisfy the framework's assumptions. IMHO Samza gives you a lot more freedom to implement the logic that you want.


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 66
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line66>
> >
> >     "reprocess the same data from since data was processed" -- sentence seems to have too many words in it?
> 
> Yan Fang wrote:
>     I think you know what I mean here...Just not sure how to phrase it less verbosely...*_*

I know what you mean, it's just copyediting to make it easy for readers to understand. Here's a suggestion:

"When a Samza job recovers from a failure, it's possible that it will process some data more than once. This happens because the job restarts at the last checkpoint, and any messages that had been processed between that checkpoint and the failure are processed again. The amount of reprocessed data can be minimized by setting a small checkpoint interval."


- Martin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review47895
-----------------------------------------------------------


On July 15, 2014, 6:15 p.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 15, 2014, 6:15 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>


Re: Review Request 23358: SAMZA-225

Posted by Yan Fang <ya...@gmail.com>.

> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 66
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line66>
> >
> >     "reprocess the same data from since data was processed" -- sentence seems to have too many words in it?

I think you know what I mean here...Just not sure how to phrase it less verbosely...*_*


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 64
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line64>
> >
> >     "a small difference": if Spark may lose messages and Samza guarantees delivery, I'd say that's a pretty big difference ;-)
> >     
> >     In fact, it's potentially a serious problem for Spark. Do you have a reference for this? Seems like the kind of problem which could be fixed, so perhaps they're working on it.

yes...unfortunately, they do. In https://spark.apache.org/docs/latest/streaming-programming-guide.html, search for "then a tiny bit of data may be lost" . Also, https://issues.apache.org/jira/browse/SPARK-1730, https://issues.apache.org/jira/browse/SPARK-1647 


- Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review47895
-----------------------------------------------------------


On July 15, 2014, 6:15 p.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 15, 2014, 6:15 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>


Re: Review Request 23358: SAMZA-225

Posted by Martin Kleppmann <ma...@kleppmann.com>.

> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 36
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line36>
> >
> >     This paragraph seems contradictory -- does Spark guarantee ordering or not? And what do you mean with "is not emphasized in the document"?
> >     
> >     My understanding is that Spark's transformation operators must be side-effect-free, so the order in which batches are processed is irrelevant. When one batch depends on the output of a previous batch (e.g. a window-based operation), Spark Streaming guarantees that the correct previous batch is used as input to the subsequent batch (which is effectively ordering, even if some of the execution may actually happen in parallel).
> >     
> >     I'm not sure about ordering of output operations (which may have side-effects).
> >     
> >     Another thing -- I believe Spark Streaming requires transformation operators to be deterministic. Is that true? If so, it would be worth mentioning, because 
> >     that may make it unsuitable for nondeterministic processing, e.g. a randomized machine learning algorithm. Samza has no such requirement.
> >     
> >     "Spark Streaming supports at-least once messaging semantics": you say below that Spark Streaming may lose messages if the receiver task fails. If this is the case, the guarantee is neither at-least-once nor at-most-once, but more like zero-or-more-times.
> 
> Yan Fang wrote:
>     When I say "is not emphasized in the document", mean that I could not find relevant documents. From my test, the messages order in one DStream seems guaranteed. But if you combine some DStreams in the process, no order is guaranteed. 
>     
>     you are right, transformation operations are side-effect-free and output operations (should) have the side-effects. And all transformation operations only happen after output operations are called (because of lazy implementation). 
>     
>     So I am a little conservative about the order of messages in Spark Streaming in case I write something wrong.
>     
>     yes for "transformation operators to be deterministic". Because you only apply the operations in a deterministic stream. Will mention that.
>     
>     they do lose data and work on that. https://issues.apache.org/jira/browse/SPARK-1730, https://issues.apache.org/jira/browse/SPARK-1647. It's a little weird in Kafka situation. Because of the consumer offset, it does not lose data but processes too many messages at the first, say , 2s, when you bring up the receiver after the failure. Maybe I should mention it as well?
> 
> Martin Kleppmann wrote:
>     "not emphasized": maybe say that in Spark, since messages are processed in batches by side-effect-free operators, the exact ordering of messages is not important in Spark.
>     
>     Good find on the data loss issues, I'd suggest linking to SPARK-1647. I don't understand the issue with Kafka. When it comes back after a failure, does it start consuming from the latest offset, or some older offset?
> 
> Yan Fang wrote:
>     In terms of Kafka, when Spark Streaming restarts, it starts from the older offset where it fails. That means, if Spark Streaming is using Kafka as the input stream, it will not lose data in a receiver/driver failure scenario. However, since there are many unprocessed messages in the Kafka ( because it does not consume any data during the failure time), it will consume all the unprocessed messages at the first interval. After that, it goes to normal situation where it consumes as the same rate as the data is coming. Now they have a patch https://issues.apache.org/jira/browse/SPARK-1341 to control the rate.
>     
>     But for sure it loses data if it's using Flume/Twitter data as the input stream.

With "older offset", do you mean the oldest offset (which might be data that is several weeks old)? Or from the last checkpoint?

If you're using an input system that doesn't buffer unprocessed messages, like Flume, Twitter or the IRC example in hello-samza, then Samza similarly loses data on container restart. There's probably no way around that. So it looks like Samza and Spark are actually the same in that regard?


- Martin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review47895
-----------------------------------------------------------


On July 15, 2014, 6:15 p.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 15, 2014, 6:15 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>


Re: Review Request 23358: SAMZA-225

Posted by Yan Fang <ya...@gmail.com>.

> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 36
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line36>
> >
> >     This paragraph seems contradictory -- does Spark guarantee ordering or not? And what do you mean with "is not emphasized in the document"?
> >     
> >     My understanding is that Spark's transformation operators must be side-effect-free, so the order in which batches are processed is irrelevant. When one batch depends on the output of a previous batch (e.g. a window-based operation), Spark Streaming guarantees that the correct previous batch is used as input to the subsequent batch (which is effectively ordering, even if some of the execution may actually happen in parallel).
> >     
> >     I'm not sure about ordering of output operations (which may have side-effects).
> >     
> >     Another thing -- I believe Spark Streaming requires transformation operators to be deterministic. Is that true? If so, it would be worth mentioning, because 
> >     that may make it unsuitable for nondeterministic processing, e.g. a randomized machine learning algorithm. Samza has no such requirement.
> >     
> >     "Spark Streaming supports at-least once messaging semantics": you say below that Spark Streaming may lose messages if the receiver task fails. If this is the case, the guarantee is neither at-least-once nor at-most-once, but more like zero-or-more-times.
> 
> Yan Fang wrote:
>     When I say "is not emphasized in the document", mean that I could not find relevant documents. From my test, the messages order in one DStream seems guaranteed. But if you combine some DStreams in the process, no order is guaranteed. 
>     
>     you are right, transformation operations are side-effect-free and output operations (should) have the side-effects. And all transformation operations only happen after output operations are called (because of lazy implementation). 
>     
>     So I am a little conservative about the order of messages in Spark Streaming in case I write something wrong.
>     
>     yes for "transformation operators to be deterministic". Because you only apply the operations in a deterministic stream. Will mention that.
>     
>     they do lose data and work on that. https://issues.apache.org/jira/browse/SPARK-1730, https://issues.apache.org/jira/browse/SPARK-1647. It's a little weird in Kafka situation. Because of the consumer offset, it does not lose data but processes too many messages at the first, say , 2s, when you bring up the receiver after the failure. Maybe I should mention it as well?
> 
> Martin Kleppmann wrote:
>     "not emphasized": maybe say that in Spark, since messages are processed in batches by side-effect-free operators, the exact ordering of messages is not important in Spark.
>     
>     Good find on the data loss issues, I'd suggest linking to SPARK-1647. I don't understand the issue with Kafka. When it comes back after a failure, does it start consuming from the latest offset, or some older offset?

In terms of Kafka, when Spark Streaming restarts, it starts from the older offset where it fails. That means, if Spark Streaming is using Kafka as the input stream, it will not lose data in a receiver/driver failure scenario. However, since there are many unprocessed messages in the Kafka ( because it does not consume any data during the failure time), it will consume all the unprocessed messages at the first interval. After that, it goes to normal situation where it consumes as the same rate as the data is coming. Now they have a patch https://issues.apache.org/jira/browse/SPARK-1341 to control the rate.

But for sure it loses data if it's using Flume/Twitter data as the input stream.


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 50
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line50>
> >
> >     "send them to executors" -> "sending them to executors"
> >     
> >     "the parallelism is simply accomplished by normal RDD operations, such as map, reduceByKey, reduceByWindow" -- this is unclear. How are these operations parallelized? I assume each RDD batch is processed sequentially by a single processing task, but different batches can be processed on different machines. Is that right?
> >     
> >     If your input stream is partitioned (i.e. there are multiple receivers), does each batch include only messages from a single partition, or are all the partitions combined into a single batch? Can you repartition a stream, e.g. when grouping on a field, to ensure that all messages with the same value in that field get grouped together? (akin to the shuffle phase in MapReduce)
> >     
> >     How does partitioning play into the ordering guarantees? Even if each partition is ordered, there typically isn't a deterministic ordering of messages across partitions; how does this interact with Spark's determinism requirement for operators?
> 
> Yan Fang wrote:
>     "How are these operations parallelized? I assume each RDD batch is processed sequentially by a single processing task, but different batches can be processed on different machines. Is that right?"
>       Not quite sure. Will update when I know.
>     
>     "If your input stream is partitioned (i.e. there are multiple receivers), does each batch include only messages from a single partition, or are all the partitions combined into a single batch? Can you repartition a stream, e.g. when grouping on a field, to ensure that all messages with the same value in that field get grouped together? (akin to the shuffle phase in MapReduce)"
>     
>     Actually when you partition the stream, every partition is a DStream. Then whatever you want to do is based on the DStream. So in the situation where you want to have a "whole" stream containing all the partitions, you will use the "union" to put them all together as a one DStream. Then "group by" is somehow provided by "ReduceByKey". In general Spark (other projects), they have groupByKey, but not supported in Spark Streaming.
>     
>     Yes, you are right. No deterministic ordering of messages across partitions. Not quite sure about " Spark's determinism requirement for operators". What does this mean?
> 
> Martin Kleppmann wrote:
>     "Spark's determinism requirement for operators": I mean that a Spark operator is required to always produce the same output when given the same input. Question is: if an operator is given the same set of input messages in a different order, is it still required to produce the same output? If yes, and if the order of input messages to an operator is not guaranteed, then that puts a limitation on what kinds of operator you can implement.
>     
>     What I'm getting at here: if you want to write a custom Spark operator, it seems that you have to obey a lot of constraints in order to satisfy the framework's assumptions. IMHO Samza gives you a lot more freedom to implement the logic that you want.

Not sure if I answer your question. In Spark Streaming, when the same set of input messages have different orders, they are thought as different data input. So the output will not be the same.

I agree with you on that Spark has a lot of constraints and Samza is more flexible. But Spark Streaming gives users the option to reuse their batch processing code. 


- Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review47895
-----------------------------------------------------------


On July 15, 2014, 6:15 p.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 15, 2014, 6:15 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>


Re: Review Request 23358: SAMZA-225

Posted by Martin Kleppmann <mk...@linkedin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review47895
-----------------------------------------------------------


Great work Yan. I know it takes a lot of work to make sense of what another framework is doing.

I've added some questions about the details below. Some might be hard to answer, but I think they're worth thinking about. If you get stuck, an option would be to commit and publish a first version of this comparison, then ask the Spark folks for their feedback, and incorporate any corrections as needed. I'll leave that to you to judge.


docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment84120>

    In order for Jekyll to recognise this as a bulleted list in Markdown, you need to leave a blank line before the first bullet point. At the moment it just renders as one long paragraph with some asterisks in the middle.



docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment84128>

    This paragraph seems contradictory -- does Spark guarantee ordering or not? And what do you mean with "is not emphasized in the document"?
    
    My understanding is that Spark's transformation operators must be side-effect-free, so the order in which batches are processed is irrelevant. When one batch depends on the output of a previous batch (e.g. a window-based operation), Spark Streaming guarantees that the correct previous batch is used as input to the subsequent batch (which is effectively ordering, even if some of the execution may actually happen in parallel).
    
    I'm not sure about ordering of output operations (which may have side-effects).
    
    Another thing -- I believe Spark Streaming requires transformation operators to be deterministic. Is that true? If so, it would be worth mentioning, because 
    that may make it unsuitable for nondeterministic processing, e.g. a randomized machine learning algorithm. Samza has no such requirement.
    
    "Spark Streaming supports at-least once messaging semantics": you say below that Spark Streaming may lose messages if the receiver task fails. If this is the case, the guarantee is neither at-least-once nor at-most-once, but more like zero-or-more-times.



docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment84138>

    Does this state DStream provide any key-value access or other query model? If it's just a stream of records, that would imply that every time a batch of input records is processed, the stream processor also needs to consume the entire state DStream. That's fine if the state is small, but with a large amount of state (multiple GB), it would probably get very inefficient. If this is true, it would further support our "Samza is good if you have lots of state" story.
    
    Also: you don't mention anything about stream joins in this comparison. I see Spark has a join operator -- do you know what it does? Does it just take one batch from each input stream and join within those batches? Or can you do joins across a longer window, or against a table?
    
    Since joins typically involve large amounts of state, they are worth highlighting as an area where Samza may be stronger.
    
    "Everytime updateStateByKey is applied, you will get a new state DStream": presumably you get a new DStream once per batch, not for every single message within a batch?



docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment84134>

    Typo: "intermedia"
    
    Is this periodic writing of state to HDFS basically a form of checkpointing? If so, it might be worth calling it such.
    
    The state management page of the docs also calls out that checkpointing the entire task state is inefficient if the state is large. Could do a cross-reference.



docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment84140>

    "its" -> "it's"
    
    "should support" -> "supports"
    
    Might be worth pointing out that in Samza you can also plug in other storage engines (../container/state-management.html#other-storage-engines), which enables great flexibility in the stream processing algorithms you can use.



docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment84145>

    "send them to executors" -> "sending them to executors"
    
    "the parallelism is simply accomplished by normal RDD operations, such as map, reduceByKey, reduceByWindow" -- this is unclear. How are these operations parallelized? I assume each RDD batch is processed sequentially by a single processing task, but different batches can be processed on different machines. Is that right?
    
    If your input stream is partitioned (i.e. there are multiple receivers), does each batch include only messages from a single partition, or are all the partitions combined into a single batch? Can you repartition a stream, e.g. when grouping on a field, to ensure that all messages with the same value in that field get grouped together? (akin to the shuffle phase in MapReduce)
    
    How does partitioning play into the ordering guarantees? Even if each partition is ordered, there typically isn't a deterministic ordering of messages across partitions; how does this interact with Spark's determinism requirement for operators?



docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment84147>

    "batch processing" -> "batch processes"



docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment84151>

    "a small difference": if Spark may lose messages and Samza guarantees delivery, I'd say that's a pretty big difference ;-)
    
    In fact, it's potentially a serious problem for Spark. Do you have a reference for this? Seems like the kind of problem which could be fixed, so perhaps they're working on it.



docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment84152>

    "reprocess the same data from since data was processed" -- sentence seems to have too many words in it?



docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment84173>

    Is this really true? I get the impression that with Spark Streaming you build an entire processing graph with a DSL API, and deploy that entire graph as one unit. The communication between the nodes in that graph (in the form of DStreams) is provided by the framework. In that way it looks similar to Storm.
    
    Samza is totally different -- each job is just a message-at-a-time processor, and there is no framework support for topologies. Output of a processing task always needs to go back to a message broker (e.g. Kafka).
    
    A positive consequence of Samza's design is that a job's output can be consumed by multiple unrelated jobs, potentially run by different teams, and those jobs are isolated from each other through Kafka's buffering. That is not the case with Storm's (and Spark Streaming's?) framework-internal streams.
    
    Although a Storm/Spark job could in principle write its output to a message broker, the framework doesn't really make this easy. It seems that Storm/Spark aren't intended to used in a way where one topology's output is another topology's input. By contrast, in Samza, that mode of usage is standard.



docs/learn/documentation/0.7.0/comparisons/spark-streaming.md
<https://reviews.apache.org/r/23358/#comment84174>

    Prefer "young" to "immature" perhaps? We shouldn't sell ourselves too short :)


- Martin Kleppmann


On July 15, 2014, 6:15 p.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 15, 2014, 6:15 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>


Re: Review Request 23358: SAMZA-225

Posted by Yan Fang <ya...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/
-----------------------------------------------------------

(Updated July 15, 2014, 6:15 p.m.)


Review request for samza.


Changes
-------

* uploaded the latest patch.
* made changes according to Martin's comments.


Repository: samza


Description
-------

Comparison of Spark Streaming and Samza


Diffs (updated)
-----

  docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
  docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
  docs/learn/documentation/0.7.0/index.html 149ff2b 

Diff: https://reviews.apache.org/r/23358/diff/


Testing
-------


Thanks,

Yan Fang


Re: Review Request 23358: SAMZA-225

Posted by Yan Fang <ya...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/
-----------------------------------------------------------

(Updated July 11, 2014, 8:37 a.m.)


Review request for samza.


Changes
-------

Updated content based on Chris' comment in JIRA.

1. added one section at beginning "overview of Spark Streaming" for giving a brief introduction.

2. added more explanation for Spark Streaming in "State Management"

3. modified the illustration of Spark Streaming in "Partitioning and Parallelism" and "fault-tolerance" sections.


Repository: samza


Description
-------

Comparison of Spark Streaming and Samza


Diffs (updated)
-----

  docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
  docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
  docs/learn/documentation/0.7.0/index.html 149ff2b 

Diff: https://reviews.apache.org/r/23358/diff/


Testing
-------


Thanks,

Yan Fang