You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by Stanislav Los <st...@magnetic.com> on 2016/01/14 16:28:33 UTC

Samza Join example

If anyone interested, I did a quick PoC attempting to join two data sets
using hello-samza as a starting point.

Points to note, I did it in Scala.
Our target was to keep at least 1 hour window of resent data at any given
point in time, i.e ~200,000,000 records/h throughput for the first data set
(ad auction bids), ~20,000,000/h for another another data set (ad
impressions). That way, we're not constrained by order of events as much
and data streams can be quite out of sync in case of replay from archive
storage.

You can find PoC that runs on local Samza grid here
https://github.com/staslos/samza-hello-samza/tree/imp_bid_join, or pull
request not for merging, but just to keep changes in one place
https://github.com/apache/samza-hello-samza/pull/6. Can't brush it up for
proper merge with master, since I'm being pulled to other task, but at
least it's not lost and someone can find it useful.
See src/main/scala/README for details.

I have another branch that runs on CDH at scale, but I think it's overkill
for current topic. Anyway, if you don't mind Magnetic specific stuff (no
legal obligations), it's here
https://github.com/staslos/samza-hello-samza/tree/imp_bid_join_cdh

Overall we were very impressed with Samza performance, it took just 30
containers (30 partitions on each Kafka topic) with default settings to do
a reliable join on our Hadoop cluster. Just for the record, on Spark
Streaming I was able to keep only a couple of minutes Bids window with lots
of other constraints and workarounds.
Samza is our way to go with large RT joins.

Regards,
Stan

Re: Samza Join example

Posted by Yi Pan <ni...@gmail.com>.

Hi, Stanislov,

That's awesome! It would be great to have this integrated w/ Samza
tutorial. Would you mind to create a tutorial page for the join job
implementation in Samza?

Thanks a lot!

-Yi

On Thu, Jan 14, 2016 at 7:28 AM, Stanislav Los <st...@magnetic.com>
wrote:

> If anyone interested, I did a quick PoC attempting to join two data sets
> using hello-samza as a starting point.
>
> Points to note, I did it in Scala.
> Our target was to keep at least 1 hour window of resent data at any given
> point in time, i.e ~200,000,000 records/h throughput for the first data set
> (ad auction bids), ~20,000,000/h for another another data set (ad
> impressions). That way, we're not constrained by order of events as much
> and data streams can be quite out of sync in case of replay from archive
> storage.
>
> You can find PoC that runs on local Samza grid here
> https://github.com/staslos/samza-hello-samza/tree/imp_bid_join, or pull
> request not for merging, but just to keep changes in one place
> https://github.com/apache/samza-hello-samza/pull/6. Can't brush it up for
> proper merge with master, since I'm being pulled to other task, but at
> least it's not lost and someone can find it useful.
> See src/main/scala/README for details.
>
> I have another branch that runs on CDH at scale, but I think it's overkill
> for current topic. Anyway, if you don't mind Magnetic specific stuff (no
> legal obligations), it's here
> https://github.com/staslos/samza-hello-samza/tree/imp_bid_join_cdh
>
> Overall we were very impressed with Samza performance, it took just 30
> containers (30 partitions on each Kafka topic) with default settings to do
> a reliable join on our Hadoop cluster. Just for the record, on Spark
> Streaming I was able to keep only a couple of minutes Bids window with lots
> of other constraints and workarounds.
> Samza is our way to go with large RT joins.
>
> Regards,
> Stan
>