You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Martin Kleppmann <mk...@linkedin.com> on 2014/02/01 01:10:38 UTC

Review Request 17603: SAMZA-136 Copy-editing documentation (introduction section)

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17603/
-----------------------------------------------------------

Review request for samza.


Repository: samza


Description
-------

Copy-editing


Diffs
-----

  docs/learn/documentation/0.7.0/introduction/architecture.md ff8357d 
  docs/learn/documentation/0.7.0/introduction/background.md 52d8e41 
  docs/learn/documentation/0.7.0/introduction/concepts.md 2736bf0 

Diff: https://reviews.apache.org/r/17603/diff/


Testing
-------


Thanks,

Martin Kleppmann


Re: Review Request 17603: SAMZA-136 Editing documentation (introduction and comparisons sections)

Posted by Chris Riccomini <cr...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17603/#review34129
-----------------------------------------------------------

Ship it!


+1 LGTM

- Chris Riccomini


On Feb. 9, 2014, 12:57 a.m., Martin Kleppmann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/17603/
> -----------------------------------------------------------
> 
> (Updated Feb. 9, 2014, 12:57 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Updates from review
> 
> 
> Edit comparisons with MUPD8 and Storm
> 
> 
> More editing, including rewrite of state section (in comparisions/introduction)
> 
> 
> Replace use of 'member' with 'user'
> 
> 
> Copy-editing documentation (introduction section)
> 
> 
> Diffs
> -----
> 
>   docs/img/0.7.0/learn/documentation/introduction/dag.png bda85b2244df5f65f5472d557900fa2a65ea55c9 
>   docs/img/0.7.0/learn/documentation/introduction/group-by-example.png 1acd355c4565ee484540897c9c1712ae0c03d185 
>   docs/index.md 976faf2838f81de7785b766e0de2e820ae0b5c76 
>   docs/learn/documentation/0.7.0/api/overview.md b2324a411e8929c03971fd64a94699e8f6ded809 
>   docs/learn/documentation/0.7.0/comparisons/introduction.md b70697ba51604b6d6b1c49e4e8ff0376d5d92ec1 
>   docs/learn/documentation/0.7.0/comparisons/mupd8.md bb0d5a11691ae80725e51b799ab56d65edcb36db 
>   docs/learn/documentation/0.7.0/comparisons/storm.md b87c2077db2527041d8ed0397e2720772862dc60 
>   docs/learn/documentation/0.7.0/container/task-runner.md 27dab79f76a34385db5e6bebec42dd0964cbb878 
>   docs/learn/documentation/0.7.0/container/windowing.md 6058707e7d51986e8e36770303835673956a50b6 
>   docs/learn/documentation/0.7.0/introduction/architecture.md ff8357dd0397156aebdc9fa30964b18c7a71c376 
>   docs/learn/documentation/0.7.0/introduction/background.md 52d8e41cccbeb5851578c95dd0edca24f2b8471f 
>   docs/learn/documentation/0.7.0/introduction/concepts.md 2736bf0985c78d0314ed2011dc768cbbc5453f49 
> 
> Diff: https://reviews.apache.org/r/17603/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Martin Kleppmann
> 
>


Re: Review Request 17603: SAMZA-136 Editing documentation (introduction and comparisons sections)

Posted by Martin Kleppmann <mk...@linkedin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17603/
-----------------------------------------------------------

(Updated Feb. 9, 2014, 12:57 a.m.)


Review request for samza.


Changes
-------

Fix things brought up in review.


Repository: samza


Description (updated)
-------

Updates from review


Edit comparisons with MUPD8 and Storm


More editing, including rewrite of state section (in comparisions/introduction)


Replace use of 'member' with 'user'


Copy-editing documentation (introduction section)


Diffs (updated)
-----

  docs/img/0.7.0/learn/documentation/introduction/dag.png bda85b2244df5f65f5472d557900fa2a65ea55c9 
  docs/img/0.7.0/learn/documentation/introduction/group-by-example.png 1acd355c4565ee484540897c9c1712ae0c03d185 
  docs/index.md 976faf2838f81de7785b766e0de2e820ae0b5c76 
  docs/learn/documentation/0.7.0/api/overview.md b2324a411e8929c03971fd64a94699e8f6ded809 
  docs/learn/documentation/0.7.0/comparisons/introduction.md b70697ba51604b6d6b1c49e4e8ff0376d5d92ec1 
  docs/learn/documentation/0.7.0/comparisons/mupd8.md bb0d5a11691ae80725e51b799ab56d65edcb36db 
  docs/learn/documentation/0.7.0/comparisons/storm.md b87c2077db2527041d8ed0397e2720772862dc60 
  docs/learn/documentation/0.7.0/container/task-runner.md 27dab79f76a34385db5e6bebec42dd0964cbb878 
  docs/learn/documentation/0.7.0/container/windowing.md 6058707e7d51986e8e36770303835673956a50b6 
  docs/learn/documentation/0.7.0/introduction/architecture.md ff8357dd0397156aebdc9fa30964b18c7a71c376 
  docs/learn/documentation/0.7.0/introduction/background.md 52d8e41cccbeb5851578c95dd0edca24f2b8471f 
  docs/learn/documentation/0.7.0/introduction/concepts.md 2736bf0985c78d0314ed2011dc768cbbc5453f49 

Diff: https://reviews.apache.org/r/17603/diff/


Testing
-------


Thanks,

Martin Kleppmann


Re: Review Request 17603: SAMZA-136 Editing documentation (introduction and comparisons sections)

Posted by Martin Kleppmann <mk...@linkedin.com>.

> On Feb. 7, 2014, 6:46 p.m., Chris Riccomini wrote:
> > docs/learn/documentation/0.7.0/comparisons/storm.md, line 10
> > <https://reviews.apache.org/r/17603/diff/2/?file=471143#file471143line10>
> >
> >     I think a spout is actually similar to a consumer (SystemConsumer) in Samza's parlance.
> >     
> >     In Storm, a spout is a thing that feeds messages from a stream into Storm's toplogies. This is what a SystemConsumer does with Samza.

Good point, I'll update it to "spouts in Storm are similar to stream consumers in Samza".


> On Feb. 7, 2014, 6:46 p.m., Chris Riccomini wrote:
> > docs/learn/documentation/0.7.0/comparisons/storm.md, line 18
> > <https://reviews.apache.org/r/17603/diff/2/?file=471143#file471143line18>
> >
> >     Even Storm's "exactly once" messaging is somewhat misleading.
> >     
> >     First, Storm only guarantees exactly once messaging within its framework? That is, if a Kafka producer sends a message, then times out (but the message makes it to the broker before the timeout), and re-sends, Storm's spout will process both messages (duplicates). This isn't really Storm's fault, but the point is that you get duplicate messages processed by your bolts.
> >     
> >     Second, what happens in the "exactly once" case in cases where the bolt is mutating state while processing a batch, and a failure occurs? As far as I know, Storm's state management requires idempotent operations, and only occurs outside of the topology, right?
> >     
> >     It might be worth discussing this, as these are both things that Samza and Kafka are attempting to address.

Re point 1, yes: messages are actually processed at least once by bolts, but the side-effects of the processing on state (e.g. counters) are idempotent when retried, so that the value of the state looks as though messages were processed exactly once.

I'm adding a note to make clear that exactly-once in Storm does not apply to external side-effects, such as sending messages to a broker.

Re point 2, I think the state implementation only requires atomic writes for a single key (i.e. in a K-V store, either the value for a key is updated or not, but you can't get old and new value spliced together). The idempotence is ensured by metadata in the value.

When idempotent producer is released, we'll have to update this doc again.


> On Feb. 7, 2014, 6:46 p.m., Chris Riccomini wrote:
> > docs/learn/documentation/0.7.0/comparisons/storm.md, line 40
> > <https://reviews.apache.org/r/17603/diff/2/?file=471143#file471143line40>
> >
> >     This is somewhat confusing. Samza does not hold a single job per process. You can have N processes (SamzaContainers) for a single job. This is configured with YARN jobs using yarn.container.count.
> >     
> >     Might be worth calling out that a single Storm process with 100 threads is equivalent to a Samza job with 100 containers.
> >     
> >

Yes, agree this was confusing. I thought it would be better to restructure this paragraph. Here's the new version -- is it better?

Storm's [parallelism model](https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology) is fairly similar to Samza's. Both frameworks split processing into independent *tasks* that can run in parallel. Resource allocation is independent of the number of tasks: a small job can keep all tasks in a single process on a single machine; a large job can spread the tasks over many processes on many machines.

The biggest difference is that Storm uses one thread per task by default, whereas Samza uses single-threaded processes (containers). A Samza container may contain multiple tasks, but there is only one thread that invokes each of the tasks in turn. This means each container is mapped to exactly one CPU core, which makes the resource model much simpler and reduces interference from other tasks running on the same machine. Storm's multithreaded model has the advantage of taking better advantage of excess capacity on an idle machine, at the cost of a less predictable resource model.


> On Feb. 7, 2014, 6:46 p.m., Chris Riccomini wrote:
> > docs/learn/documentation/0.7.0/comparisons/storm.md, line 68
> > <https://reviews.apache.org/r/17603/diff/2/?file=471143#file471143line68>
> >
> >     You might want to call this out in the exactly once discussion above. If you have two topologies communicating with each other, they need to send messages through an underlying system (Kafka, HDFS, Kestrel, etc). This will break exactly-once messaging.

Done.


> On Feb. 7, 2014, 6:46 p.m., Chris Riccomini wrote:
> > docs/learn/documentation/0.7.0/comparisons/storm.md, line 96
> > <https://reviews.apache.org/r/17603/diff/2/?file=471143#file471143line96>
> >
> >     Can't this be done in Samza by running a web service in a container, using streams to pass messages, and then having the web service container block until it receives a response message?

I think it could be done in Samza, but you'd have to do all the message routing yourself, making sure that the response is matched back to the request that generated it. Storm provides that as a built-in feature. (Storm's low-latency ZeroMQ communication perhaps also helps making this more practical, compared to several hops through Kafka?)

Personally I don't think this is a very important feature -- I reckon it's useful only for very specialised use cases, and Nathan happened to have such a use case for Twitter analytics. But I thought it was worth mentioning, as the Storm docs talk about it prominently, and it's kinda clever.

I'll add a note saying you can build DRPC yourself if you want it.


> On Feb. 7, 2014, 6:46 p.m., Chris Riccomini wrote:
> > docs/learn/documentation/0.7.0/introduction/background.md, line 34
> > <https://reviews.apache.org/r/17603/diff/2/?file=471147#file471147line34>
> >
> >     Can you make these changes to Samza's index (landing) page as well? These two descriptions are identical, and should ideally be kept in sync.

Ok, I've brought the landing page and this page back in sync.


- Martin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17603/#review33936
-----------------------------------------------------------


On Feb. 6, 2014, 10:58 p.m., Martin Kleppmann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/17603/
> -----------------------------------------------------------
> 
> (Updated Feb. 6, 2014, 10:58 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Copy-edited the 'introduction' and 'comparisons' sections of the documentation, to make it more fluid to read.
> 
> Changed all uses of the word 'member' (which is quite LinkedIn-specific terminology) to refer to 'user' instead.
> 
> Rewrote the explanation of state manatement (in comparisions/introduction) as I found it confusing.
> 
> Rewrote the page comparing Samza with Storm, because it was outdated and no longer represented Storm accurately.
> 
> 
> Diffs
> -----
> 
>   docs/img/0.7.0/learn/documentation/introduction/dag.png bda85b2244df5f65f5472d557900fa2a65ea55c9 
>   docs/img/0.7.0/learn/documentation/introduction/group-by-example.png 1acd355c4565ee484540897c9c1712ae0c03d185 
>   docs/learn/documentation/0.7.0/api/overview.md b2324a411e8929c03971fd64a94699e8f6ded809 
>   docs/learn/documentation/0.7.0/comparisons/introduction.md b70697ba51604b6d6b1c49e4e8ff0376d5d92ec1 
>   docs/learn/documentation/0.7.0/comparisons/mupd8.md bb0d5a11691ae80725e51b799ab56d65edcb36db 
>   docs/learn/documentation/0.7.0/comparisons/storm.md b87c2077db2527041d8ed0397e2720772862dc60 
>   docs/learn/documentation/0.7.0/container/task-runner.md 27dab79f76a34385db5e6bebec42dd0964cbb878 
>   docs/learn/documentation/0.7.0/container/windowing.md 6058707e7d51986e8e36770303835673956a50b6 
>   docs/learn/documentation/0.7.0/introduction/architecture.md ff8357dd0397156aebdc9fa30964b18c7a71c376 
>   docs/learn/documentation/0.7.0/introduction/background.md 52d8e41cccbeb5851578c95dd0edca24f2b8471f 
>   docs/learn/documentation/0.7.0/introduction/concepts.md 2736bf0985c78d0314ed2011dc768cbbc5453f49 
> 
> Diff: https://reviews.apache.org/r/17603/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Martin Kleppmann
> 
>


Re: Review Request 17603: SAMZA-136 Editing documentation (introduction and comparisons sections)

Posted by Chris Riccomini <cr...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17603/#review33936
-----------------------------------------------------------



docs/learn/documentation/0.7.0/comparisons/storm.md
<https://reviews.apache.org/r/17603/#comment63750>

    I think a spout is actually similar to a consumer (SystemConsumer) in Samza's parlance.
    
    In Storm, a spout is a thing that feeds messages from a stream into Storm's toplogies. This is what a SystemConsumer does with Samza.



docs/learn/documentation/0.7.0/comparisons/storm.md
<https://reviews.apache.org/r/17603/#comment63751>

    Even Storm's "exactly once" messaging is somewhat misleading.
    
    First, Storm only guarantees exactly once messaging within its framework? That is, if a Kafka producer sends a message, then times out (but the message makes it to the broker before the timeout), and re-sends, Storm's spout will process both messages (duplicates). This isn't really Storm's fault, but the point is that you get duplicate messages processed by your bolts.
    
    Second, what happens in the "exactly once" case in cases where the bolt is mutating state while processing a batch, and a failure occurs? As far as I know, Storm's state management requires idempotent operations, and only occurs outside of the topology, right?
    
    It might be worth discussing this, as these are both things that Samza and Kafka are attempting to address.



docs/learn/documentation/0.7.0/comparisons/storm.md
<https://reviews.apache.org/r/17603/#comment63754>

    This is somewhat confusing. Samza does not hold a single job per process. You can have N processes (SamzaContainers) for a single job. This is configured with YARN jobs using yarn.container.count.
    
    Might be worth calling out that a single Storm process with 100 threads is equivalent to a Samza job with 100 containers.
    
    



docs/learn/documentation/0.7.0/comparisons/storm.md
<https://reviews.apache.org/r/17603/#comment63756>

    You might want to call this out in the exactly once discussion above. If you have two topologies communicating with each other, they need to send messages through an underlying system (Kafka, HDFS, Kestrel, etc). This will break exactly-once messaging.



docs/learn/documentation/0.7.0/comparisons/storm.md
<https://reviews.apache.org/r/17603/#comment63758>

    Can't this be done in Samza by running a web service in a container, using streams to pass messages, and then having the web service container block until it receives a response message?



docs/learn/documentation/0.7.0/introduction/background.md
<https://reviews.apache.org/r/17603/#comment63761>

    Can you make these changes to Samza's index (landing) page as well? These two descriptions are identical, and should ideally be kept in sync.


- Chris Riccomini


On Feb. 6, 2014, 10:58 p.m., Martin Kleppmann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/17603/
> -----------------------------------------------------------
> 
> (Updated Feb. 6, 2014, 10:58 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Copy-edited the 'introduction' and 'comparisons' sections of the documentation, to make it more fluid to read.
> 
> Changed all uses of the word 'member' (which is quite LinkedIn-specific terminology) to refer to 'user' instead.
> 
> Rewrote the explanation of state manatement (in comparisions/introduction) as I found it confusing.
> 
> Rewrote the page comparing Samza with Storm, because it was outdated and no longer represented Storm accurately.
> 
> 
> Diffs
> -----
> 
>   docs/img/0.7.0/learn/documentation/introduction/dag.png bda85b2244df5f65f5472d557900fa2a65ea55c9 
>   docs/img/0.7.0/learn/documentation/introduction/group-by-example.png 1acd355c4565ee484540897c9c1712ae0c03d185 
>   docs/learn/documentation/0.7.0/api/overview.md b2324a411e8929c03971fd64a94699e8f6ded809 
>   docs/learn/documentation/0.7.0/comparisons/introduction.md b70697ba51604b6d6b1c49e4e8ff0376d5d92ec1 
>   docs/learn/documentation/0.7.0/comparisons/mupd8.md bb0d5a11691ae80725e51b799ab56d65edcb36db 
>   docs/learn/documentation/0.7.0/comparisons/storm.md b87c2077db2527041d8ed0397e2720772862dc60 
>   docs/learn/documentation/0.7.0/container/task-runner.md 27dab79f76a34385db5e6bebec42dd0964cbb878 
>   docs/learn/documentation/0.7.0/container/windowing.md 6058707e7d51986e8e36770303835673956a50b6 
>   docs/learn/documentation/0.7.0/introduction/architecture.md ff8357dd0397156aebdc9fa30964b18c7a71c376 
>   docs/learn/documentation/0.7.0/introduction/background.md 52d8e41cccbeb5851578c95dd0edca24f2b8471f 
>   docs/learn/documentation/0.7.0/introduction/concepts.md 2736bf0985c78d0314ed2011dc768cbbc5453f49 
> 
> Diff: https://reviews.apache.org/r/17603/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Martin Kleppmann
> 
>


Re: Review Request 17603: SAMZA-136 Editing documentation (introduction and comparisons sections)

Posted by Martin Kleppmann <mk...@linkedin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17603/
-----------------------------------------------------------

(Updated Feb. 6, 2014, 10:58 p.m.)


Review request for samza.


Changes
-------

Now edited the 'comparisons' section of the docs as well.


Summary (updated)
-----------------

SAMZA-136 Editing documentation (introduction and comparisons sections)


Repository: samza


Description (updated)
-------

Copy-edited the 'introduction' and 'comparisons' sections of the documentation, to make it more fluid to read.

Changed all uses of the word 'member' (which is quite LinkedIn-specific terminology) to refer to 'user' instead.

Rewrote the explanation of state manatement (in comparisions/introduction) as I found it confusing.

Rewrote the page comparing Samza with Storm, because it was outdated and no longer represented Storm accurately.


Diffs (updated)
-----

  docs/img/0.7.0/learn/documentation/introduction/dag.png bda85b2244df5f65f5472d557900fa2a65ea55c9 
  docs/img/0.7.0/learn/documentation/introduction/group-by-example.png 1acd355c4565ee484540897c9c1712ae0c03d185 
  docs/learn/documentation/0.7.0/api/overview.md b2324a411e8929c03971fd64a94699e8f6ded809 
  docs/learn/documentation/0.7.0/comparisons/introduction.md b70697ba51604b6d6b1c49e4e8ff0376d5d92ec1 
  docs/learn/documentation/0.7.0/comparisons/mupd8.md bb0d5a11691ae80725e51b799ab56d65edcb36db 
  docs/learn/documentation/0.7.0/comparisons/storm.md b87c2077db2527041d8ed0397e2720772862dc60 
  docs/learn/documentation/0.7.0/container/task-runner.md 27dab79f76a34385db5e6bebec42dd0964cbb878 
  docs/learn/documentation/0.7.0/container/windowing.md 6058707e7d51986e8e36770303835673956a50b6 
  docs/learn/documentation/0.7.0/introduction/architecture.md ff8357dd0397156aebdc9fa30964b18c7a71c376 
  docs/learn/documentation/0.7.0/introduction/background.md 52d8e41cccbeb5851578c95dd0edca24f2b8471f 
  docs/learn/documentation/0.7.0/introduction/concepts.md 2736bf0985c78d0314ed2011dc768cbbc5453f49 

Diff: https://reviews.apache.org/r/17603/diff/


Testing
-------


Thanks,

Martin Kleppmann