You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Bart De Vylder <ba...@gmail.com> on 2015/12/07 12:15:39 UTC

table-stream join

Hi all,

I'm rather new to Samza and trying some things out using Kafka as the
message broker. One usecase i was interested in which is mentioned on the
documentation is creating a table-stream join using bootstrap streams.

I'm interested in some recommendations/thoughts concerning the changelog
and database possibly going out of sync.

Suppose I have my database push a changelog to Kafka for every
insert/update/delete and then have a Samza job consume this stream as a
bootstrap (+ maybe some other datastream).

The only info about the database this job will ever see is by reading the
Kafka stream containing the changelogs (maybe compacted by kafka based on
key..). So losing any of these changelog messages ever is not an option as
then this job's view on the database will be wrong forever. This implies
that Kafka needs to be forced in fsyncing every new message for this
changelog topic? Or would it be better to still provide a complete
recreation of the changelog stream based on the current contents of the
database in case of disaster (all Kafka nodes losing power at the same
time). Or would it be bettter to recreate the database based on the
changelog (still some dataloss but at least the database and the changelog
are in sync).

Any thought/experiences/references iis much appreciated.
Regards,
Bart


-- 
Bart De Vylder
+32(0)496/558065
bartdevylder@gmail.com

Re: table-stream join

Posted by Yi Pan <ni...@gmail.com>.
Hi, Bart,

Your question is more like "is Kafka reliable against failures"? As for the
reliability of the changelog, Samza is designed as reliable as the
underlying messaging layer provides. In the case of Kafka, there are
configurations in the Kafka producer that users can tune up to make sure of
no data loss. One example from the Kafka documentation: min.insync.replicas
and request.required.acks allow you to enforce greater durability
guarantees. A typical scenario would be to create a topic with a
replication factor of 3, set min.insync.replicas to 2, and produce with
request.required.acks of -1. This will ensure that the producer raises an
exception if a majority of replicas do not receive a write. Of course,
depend on the failure model, you may still see that the guarantees from the
configuration is not enough to cover a whole cluster crash down, for
example. But this would be a typical tradeoff between performance and
reliability in configuration (i.e. the more replica and acks you configure,
the less write throughput you may see).

And the More detailed configurations could be found from
http://kafka.apache.org/documentation.html#configuration.

Cheers,

-Yi

On Mon, Dec 7, 2015 at 3:15 AM, Bart De Vylder <ba...@gmail.com>
wrote:

> Hi all,
>
> I'm rather new to Samza and trying some things out using Kafka as the
> message broker. One usecase i was interested in which is mentioned on the
> documentation is creating a table-stream join using bootstrap streams.
>
> I'm interested in some recommendations/thoughts concerning the changelog
> and database possibly going out of sync.
>
> Suppose I have my database push a changelog to Kafka for every
> insert/update/delete and then have a Samza job consume this stream as a
> bootstrap (+ maybe some other datastream).
>
> The only info about the database this job will ever see is by reading the
> Kafka stream containing the changelogs (maybe compacted by kafka based on
> key..). So losing any of these changelog messages ever is not an option as
> then this job's view on the database will be wrong forever. This implies
> that Kafka needs to be forced in fsyncing every new message for this
> changelog topic? Or would it be better to still provide a complete
> recreation of the changelog stream based on the current contents of the
> database in case of disaster (all Kafka nodes losing power at the same
> time). Or would it be bettter to recreate the database based on the
> changelog (still some dataloss but at least the database and the changelog
> are in sync).
>
> Any thought/experiences/references iis much appreciated.
> Regards,
> Bart
>
>
> --
> Bart De Vylder
> +32(0)496/558065
> bartdevylder@gmail.com
>