You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Ben Kirwin (JIRA)" <ji...@apache.org> on 2014/12/01 01:08:13 UTC

[jira] [Commented] (SAMZA-390) High-Level Language for Samza

    [ https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229292#comment-14229292 ] 

Ben Kirwin commented on SAMZA-390:
----------------------------------

I'm a bit late to the conversation, so excuse me if I jump right in.

I've been developing {{coast}}, another high-level streaming project that compiles down to Samza jobs. (https://github.com/bkirwi/coast) The implementation and 'frontend DSL' for this project is in Scala, and there's also some support code for unit testing and job-graph visualization. Unlike most of what's discussed above, this project is *not* directly SQL-inspired; by analogy to the Hadoop ecosystem, it's more of a Cascading than a Hive. (I suspect you could throw a SQLish frontend on top of it, in the same way Spark SQL does for Spark, but nobody's actually tried this yet.) {{coast}}'s exactly-once semantics are largely inspired by Kafka's log model, and many of the design choices are intended to make it a 'good citizen' in a larger Kafka-based infrastructure.

Since we're taking fairly different approaches, I'm not sure how well my conclusions translate over. I'm still trying to plow through some of the papers in this thread (thanks for these!) so for the moment, I'll leave some notes on managing time.

[~yipan] mentioned some difficulties with reconciling the framework- and application-level views of time: machines may disagree about the current time, messages may be processed arbitrarily long after they're sent, timestamps in a stream might not be monotonically increasing, and messages in different partitions might be ordered at the application level but not within Kafka / another backing store. I regard all these issues as pretty fundamental in a distributed context. I'm also pessimistic about any framework's ability to 'paper over' them; there are a bunch of viable solutions, none of which seem to work seamlessly for all problems. 

So far, like Freshet, {{coast}} has basically punted on this: the only notion of time / ordering comes from Kafka's ordering within a partition. It turns out you can get pretty far with this, but not all the way -- for example, you can window by message count but not by time. To bridge that gap in expressiveness, I've been playing with the idea of a 'clock stream', which just produces a 'tick' message every *n* seconds. It turns out that this type of stream has essentially the same semantics as a Kafka-backed stream does: every task sees the same messages in the same order, but possibly skewed slightly in time depending on the local clock; and it's possible to 'checkpoint' your position in the stream by remembering the time of the last successfully-processed 'tick'. For {{coast}}, this turns out to be very useful -- I already have a rich language for manipulating streams, so by making clock streams 'first class', it's possible to implement most (all?) of the time-based windowing strategies discussed above in library code.

In a SQL-type API, it would probably be awkward to manipulate streams of ticks directly; but it might still be useful as a implementation mechanism or the basis for a semantic model.

As noted, this is all somewhat speculative for me at the moment; I'll leave a note here once I get around to implementing this.

> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>              Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are defined in this language and transformed to a dataflow graph where the nodes are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)