You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Yi Pan (Data Infrastructure) (JIRA)" <ji...@apache.org> on 2015/02/03 23:04:37 UTC
[jira] [Commented] (SAMZA-390) High-Level Language for Samza

    [ https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304108#comment-14304108 ] 

Yi Pan (Data Infrastructure) commented on SAMZA-390:
----------------------------------------------------

Hi, [~jkreps], just refreshed my memory on the tuple vs time window comparison in http://cs.brown.edu/~ugur/streamsql.pdf

{quote}
Key difference is the "tuple-driven" vs "time-driven" distinction. Personally I thought tuple driven is a much closer fit to the underlying Kafka concepts (an ordered stream of tuples).
{quote}

I agree from the ordering point of view, tuple-driven is a natural fit w/ Kafka concept. Revisiting the problems presented by the paper, there are mainly the following two issues:
# in time-based window, when events happened with the same timestamp, there is no defined ordering.
# in tuple-based window, events in the same window (by number of tuples) may actually expand over long physical time and break the "simultaneity" assumption.

In Samza/Kafka world, we don't need to worry about ordering in a stream if:
# ordering in the same stream is always defined by Kafka partitioned ordering
# ordering between different input streams is always defined by a consistent MessageSelector among the input streams

Now, the only issue to be resolved is how to maintain the "simultaneity" semantics among different streams. Here are a few things to consider:
# what's the definition of "simultaneity"? Based on the time both events arrives at Samza consumer or based on the actual application timestamp? Ideally it should be later, but it requires the producer to explicitly tag the events, and also requires producers to follow the same wall-clock. Short of that, we can only inject the system timestamp at the ingress point of Samza job, at best.
# how to maintain the ordering between the streams? The paper proposed a total order of events in the input streams in a single job: a) order by timestamp of the event to maintain "simultaneity"; b) define a total order for events w/ the same timestamp from all input streams. I believe that we have a way to define the total oder for all input streams in Samza. The tricky part is to maintain the "simultaneity" semantics which requires to sort the events based on timestamp first, then the ordering provided by the Kafka system for "same time events".

Here is what I am thinking as the first step implementation:
# define "simultaneous event" by the timestamp of the ingress Samza consumer. Hence, w/ only one Samza job per query, we don't need to re-sort the received events from Kafka streams since they are already sorted based on "arrival timestamp"

In the future, the extension to application timestamp could be the following:
# Each event will be tagged by ( _app-ts_, _strm-order-no_ , _offset_ ) and all events from all input streams can be sorted in a single ordered sequence of events.
# There will be a termination condition for the last event with a timestamp that is before a certain _app_ts_ that can still be accepted by the Job (i.e. it may or may not be the heart-beat method). Upon closing this window, the ordered sequence of events from all input streams are determined and should be processed. 
## I have not come up with a good idea of "closing window condition" that can reliably work in a distributed environment yet. Some LinkedIn jobs are using timeout method to close a window, like the call-graph jobs.

> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>          Components: sql
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>              Labels: project
>         Attachments: StreamSQLforSAMZA-v0.1.docx.docx
>
>
> Discussion about high-level languages to define Samza queries. Queries are defined in this language and transformed to a dataflow graph where the nodes are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)