You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Jungtaek Lim (JIRA)" <ji...@apache.org> on 2014/11/17 01:56:34 UTC

[jira] [Commented] (STORM-151) Support multilang in Trident

    [ https://issues.apache.org/jira/browse/STORM-151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214148#comment-14214148 ] 

Jungtaek Lim commented on STORM-151:
------------------------------------

I'm interested to this issue. Is it still valid?

Btw, I read description of this issue, but I don't understand what Nathan said.
(Currently I can only understand colinsurprenant's first approach, and it could be valid if we don't think about performance.)

Could you share some sketches if you don't mind?
Thanks in advance!

> Support multilang in Trident
> ----------------------------
>
>                 Key: STORM-151
>                 URL: https://issues.apache.org/jira/browse/STORM-151
>             Project: Apache Storm
>          Issue Type: Improvement
>            Reporter: James Xu
>
> https://github.com/nathanmarz/storm/issues/353
> Since it's impractical to have a separate subprocess for every operation packed into a bolt, I think the best tradeoff is to allow multilang operations that will just be their own bolt. This is a low level exposure of the API and the multilang program will have to make sure to emit the batch id as the first field. Multilang operations would be treated similar to aggregators in that the tuples they produce are brand new with brand new fields.
> ----------
> colinsurprenant: I am trying to wrap my head around this issue. Would it make sense to expose classes like ShellFunction, ShellAggregator, ShellFilter, etc?
> A topology definition could be something like:
> topology.newDRPCStream("words")
>       .each(new Fields("args"), new ShellFunction("ruby", "split.rb"), new Fields("word"))
>       .groupBy(new Fields("word"))
>       .stateQuery(wordCounts, new Fields("word"), new MapGet(), new Fields("count"))
>       .each(new Fields("count"), new ShellFilter("ruby", "somefilter.rb"))
>       .aggregate(new Fields("count"), new ShellCombinerAggregator("ruby", "someaggregator.rb"), new Fields("sum"));
> Then the builder would assign a new shell bolt for each of these shell operations.
> ----------
> nathanmarz: That would be one way to go about it – though that will have a lot of overhead with serializing back and forth between Java-land and multilang-land. This is especially true if you have many operations running in the same bolt (like 5 eaches in a row). Another way to go about it would be to expose a lower level facility, where the shell process is a regular bolt and has full control over the input and output tuples. Whatever fields it emits will be the fields available in the output tuples. Additionally, one of Trident's constraints is that the first field of output tuples contain the "batch id".
> ----------
> colinsurprenant:Right. My proposition is basically what you described as "impractical to have a separate subprocess for every operation" in the issue description.
> So what you are suggesting is to just expose a ShellBolt, which would basically respect the same semantic as the general Aggregator operation in that it can emit any number of tuples with any number of fields in regard to to input batch.
> ----------
> joshbronson: I'm probably missing something. But would exposing an interface for an aggregator and using it carefully to avoid spawning too many processes accomplish something similar, without having to worry about batch ids?
> We've run up against this problem recently and come up with a couple of ways to get around the many-processes issue. We've considered spawning a server that would listen on a standard TCP port range or Unix sockets. Is there a place to hook in, other than the prepare method, to launch such a beast? Other functions would presumably use their prepare method to discover and connect to the server, and it would be nice to avoid races or locks.
> Also, we're working on an adapter for "BaseFunction" at the moment, though it's not quite ready for public use, and I wonder if anything will go wrong if a function continues to emit tuples to its collector after its execute method has returned. We're operating under the assumption that something will, so we've required the shell command the tell me when it's done responding to the input we just gave it. Currently we do that by requiring it to emit a JSON array. If we go with the server approach, we may also require the script to handle a tab-delimited ID telling the server to delegate to the appropriate handler for a function. Presumably this would require a thin library in languages used to script storm. There seems to be an appetite here for open-sourcing the components required to do this in Ruby; we definitely want to make this easy for folks.
> ----------
> quintona: Hi Guys, I am needing the multilang support. From what I gather, the idea would be to hook the shell bolt into the trident topology, and have it ignore and handle the batch id field?
> If that understanding is correct, then it raises the interesting question around support for any bolt within a Trident topology.
> ----------
> nathanmarz: Well, the right way to go is to essentially expose the Aggregator interface via multilang. This is general enough to encompass functions, filters, and aggregation. We could also allow users to mark multilang aggregators as committers so that they can achieve exactly-once semantics when interacting with external state.
> ----------
> quintona: I wasn't intending to allow the updating of external state via a multilang function. Surely that is a multilang state discussion, which is fundamentally different? Or am I missing something?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)