You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Chris Riccomini (JIRA)" <ji...@apache.org> on 2014/03/17 18:35:42 UTC

[jira] [Commented] (SAMZA-184) Add thin multi-language support for SamzaContainer

    [ https://issues.apache.org/jira/browse/SAMZA-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938082#comment-13938082 ] 

Chris Riccomini commented on SAMZA-184:
---------------------------------------

My initial instinct is that we should favor simple convenient solutions over more performant solutions with this implementation. If performance is a major concern, developers should drop down into the JVM. That said, I think we're going to need to be able to write non-JVM libraries (Python, Go, etc) that can handle at least a few thousand messages per second (per container) in order for this to be at all useful.

bq. Should we start one subprocess per SamzaContainer, or one subprocess per StreamTask?

Starting one subprocess per StreamTask fits Samza's processing model a lot more than starting one subprocess per container. The problem with this approach is that it could potentially start 100s or even 1000s of processes on a single node in cases where a SamzaContainer is consuming from a large number of partitions. One could argue that a single container consuming 100s or 1000s of partitions should be written on the JVM to get proper performance. In that case, we have to be up front that this implementation is about convenience and not performance.

On the flip-side, if we start one subprocess per SamzaContainer, we need a way to share the subprocess' input/output transport connections between different StreamTask instances, all of whom would need to send messages to the subprocess. This could be done with a static variable, but that seems a bit hacky. If we agree on an HTTP/TCP based transport, we could use the TaskLifecycleListener (or add a ContainerLifecycleListener) to start the subprocess on container start. In this case, the StreamTasks just need to know the port to connect to to start sending messages to the process. This still requires getting the StreamTasks the port information, but I think we could come up with a way to do that.

bq. How should the parent interact with the subprocess at both the transport (stdin/stdout, unix sockets, TCP, HTTP, Thrift, etc) and serialization level (protobuf, json, etc)?

It seems to me that the main trade-off here is really performance vs. convenience. On the transport side, stdin/stdout or HTTP seem the most convenient. On the serde side, JSON clearly the most convenient, but definitely not the most performant. This decision is closely tied to the one-at-a-time vs. batching question (below).

bq. What should the protocol look like? We should ideally support all of the operations in StreamTask, InitableTask, WindowableTask, ClosableTask, etc.

Ideally, we'd support all operations in all existing interfaces. 

bq. Should the child process receive the messages in batches, or one at a time?

The one-at-a-time approach fits Samza's processing model much better. This affects our decision about transport and serde. If we stick with one message at a time, and pay heavy transport RPC and serde costs (HTTP/JSON), this might end up being too slow for even trivial use cases.

> Add thin multi-language support for SamzaContainer
> --------------------------------------------------
>
>                 Key: SAMZA-184
>                 URL: https://issues.apache.org/jira/browse/SAMZA-184
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.6.0
>            Reporter: Chris Riccomini
>
> There has been some interest in supporting languages other than Java (or JVM-based languages). We have already opened up SAMZA-18, which proposes supporting a C implementation of SamzaContainer.
> A second solution to this problem is to have a StreamTask implementation that starts a child process in another language, and acts as a bridge between the child process and the java-based Samza APIs. This is the way that both Storm [1] and Hadoop work.
> A lot of design decisions need to be fleshed out to support this, but most people on the mailing list were very supportive of this approach. [2]
> Things that need to be decided:
> 1. Should we start one subprocess per SamzaContainer, or one subprocess per StreamTask?
> 2. How should the parent interact with the subprocess at both the transport (stdin/stdout, unix sockets, TCP, HTTP, Thrift, etc) and serialization level (protobuf, json, etc)?
> 3. What should the protocol look like? We should ideally support all of the operations in StreamTask, InitableTask, WindowableTask, ClosableTask, etc.
> 4. Should the child process receive the messages in batches, or one at a time?
> It'd be good to get a draft proposal up on the Wiki, so we can all discuss this and converge on an implementation.
> [1] https://github.com/nathanmarz/storm/wiki/Multilang-protocol
> [2] http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201403.mbox/%3CCAB%2B2NVXX2Fq_61WfvH%2BAfW8ZW7vQbVfTN-JPGU%2Bd7AdZ73oPDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.2#6252)