You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Bhavesh Mistry <mi...@gmail.com> on 2014/07/17 03:28:53 UTC

Samza Use Case

Hi Dev Team,

I have attended recent Kafka Conference at LinkedIn Headquarter and you
guys were doing something similar to what we are doing "Aggregated Call
Graph across Many Web Requests" per Some interval.

We have Transaction Logging API Similar ( Think of it as Graph which stitch
logs across many servers similar to
http://devopsdotcom.files.wordpress.com/2012/11/screen-shot-2012-11-11-at-10-06-39-am.png
and we have view similar to Network Tab in Chrome Inspect (call graph).

We use end to end Kafka as Data Pipeline included Producer
=>Brokers=>MM=>Brokers=>(Camus Job HDFS ).  With Our transaction logs, we
wanted to aggregate metrics(std,min,max, 1 to 99 percentile) etc across
many Web Request.   The person on Conf were taking about using Samza for
doing this I just wanted to understand what is Back-end Storage and
near-real time storage for this graph call and how Samza is better than
Stream Processor such as Mupd8, and Storm.

I would appropriate if you can share your experience...


Thanks,

Bhavesh

Re: Samza Use Case

Posted by Chris Riccomini <cr...@linkedin.com.INVALID>.
Hey Bhavesh,

The most useful feature for use case is going to be state management.
You'll need to store events in some state store as the events come in, if
you wish to calculate windowed metrics like 99th percentile, min, max,
etc. This usually requires a lot of space if your requests are
high-volume. Using Samza's state management feature, you can store this
data in a local LevelDB store, and calculate your aggregates periodically.

The full flow for the aggregated call graph topology looks like this:

1. Web services send logs (<service-name>-log_event) and service call
(<service-name>-service_call) metrics to Kafka broker.
2. These topics are mirrored to a non-production colo.
3. A "repartitioner" Samza job reads all of these topics, and emits the
events back out again to two topics: all-service-calls, all-log-events.
The "repartitioner" job emits these messages partitioned by their "page
key", which is a key unique to a single page load. Thus, all messages for
a single page load are written into a single partition for both the
all-service-calls and all-log-events topics.
4. A "aggregator" job reads the all-* topics, and buffers the events for a
period of time. After a configurable period of time (I think we use a
minute or two), we group all events by their page keys, and emit one
message per page key, where the format is something like page-key => {
service-calls: [ {...} ], log-events: [ {...} ] }. All of these messages
are then sent to a downstream topic assembled-call-graphs.

>From here, we do a number of things with the data. We have some consumers
that read the assembled-call-graphs and do alerting. We have others that
load the data into a MySQL DB, and provide a front-end UI to view
individual call graphs, or aggregated views of the call graphs (something
similar to the network tab in Chrome, like you said). We also load the
data into HDFS via Camus.

For (3), you might want to have a look at Samza's ConfigRewriter class:

  
https://samza.incubator.apache.org/learn/documentation/0.7.0/api/javadocs/o
rg/apache/samza/config/ConfigRewriter.html

Samza's input streams are defined in config (task.inputs). This is
problematic since the task.inputs for the repartitioner job will change
over time as we add new web services. The ConfigRewriter allows us to pick
up all new service call/log even topics when the job is restarted without
having to manually edit the job's config file. Ideally, we'd like to not
even have to restart the job to pick up new streams, but we're not there
yet.

For (4), one useful feature that Samza provides is state management. If
your buffered events are too large for memory, you can use LevelDB, which
ships with Samza, to store your events locally with your stream process.
Details are here:

  
http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state
-management.html


I'll also mention that we're planning on writing a more in-depth blog post
on this use case on LinkedIn's engineering blog in the future, so keep
your eye out for that.

Cheers,
Chris

On 7/16/14 6:28 PM, "Bhavesh Mistry" <mi...@gmail.com> wrote:

>Hi Dev Team,
>
>I have attended recent Kafka Conference at LinkedIn Headquarter and you
>guys were doing something similar to what we are doing "Aggregated Call
>Graph across Many Web Requests" per Some interval.
>
>We have Transaction Logging API Similar ( Think of it as Graph which
>stitch
>logs across many servers similar to
>http://devopsdotcom.files.wordpress.com/2012/11/screen-shot-2012-11-11-at-
>10-06-39-am.png
>and we have view similar to Network Tab in Chrome Inspect (call graph).
>
>We use end to end Kafka as Data Pipeline included Producer
>=>Brokers=>MM=>Brokers=>(Camus Job HDFS ).  With Our transaction logs, we
>wanted to aggregate metrics(std,min,max, 1 to 99 percentile) etc across
>many Web Request.   The person on Conf were taking about using Samza for
>doing this I just wanted to understand what is Back-end Storage and
>near-real time storage for this graph call and how Samza is better than
>Stream Processor such as Mupd8, and Storm.
>
>I would appropriate if you can share your experience...
>
>
>Thanks,
>
>Bhavesh