You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Philip Schmitt <ph...@outlook.com> on 2017/09/12 19:19:41 UTC

Reliably producing records to remote cluster: what are my options?

Hi!



We want to reliably produce events into a remote Kafka cluster in (mostly) near real-time. We have to provide an at-least-once guarantee.

Examples are a "Customer logged in" event, that will be consumed by a data warehouse for reporting (numbers should be correct) or a "Customer unsubscribed from newsletter" event, that determines whether the customer gets emails (if she unsubscribes, but the message is lost, she will not be happy).



Context:

  *   We run an ecommerce website on a cluster of up to ten servers and an Oracle database.
  *   We have a small Kafka cluster at a different site. We have in the past had a small number of network issues, where the web servers could not reach the other site for maybe an hour.
  *   We don't persist all events in the database. If the application is restarted, events that occurred before the restart cannot be sent to Kafka. The row of a customer might have a newer timestamp, but we couldn't tell which columns were changed.



Concerns:

  *   In case of, for example, a network outage between the web servers and the Kafka cluster, we may accumulate thousands of events on each web server that cannot be sent to Kafka. If a server is shut down during that time, the messages would be lost.
  *   If we produce to Kafka from within the application in addition to writing to the database, the data may become inconsistent if one of the writes fails.





The more I read about Kafka, the more options I see, but I cannot assess, how well the options might work and what the trade-offs between the options are.



  1.  produce records directly within the application
  2.  produce records from the Oracle database via Kafka Connect
  3.  produce records from the Oracle database via a CDC solution (GoldenGate, Attunity, Striim, others?)
  4.  persist events in log files and produce to Kafka via elastic Logstash/Filebeat
  5.  persist events in log files and produce to Kafka via a Kafka Connect source connector
  6.  persist events in a local, embedded database and produce to Kafka via an existing source connector
  7.  produce records directly within the application to a new Kafka cluster in the same network and mirror to remote cluster
  8.  ?



These are all the options I could gather so far. Some of the options probably won't work for my situation -- for example Oracle Golden Gate might be too expensive -- but I don't want to rule anything out just yet.





How would you approach this, and why? Which options might work? Which options would you advise against?




I appreciate any advice. Thank you in advance.


Thanks,

Philip

Re: Reliably producing records to remote cluster: what are my options?

Posted by Philip Schmitt <ph...@outlook.com>.
I tried to write down the pros and cons of each approach to the best of my current knowledge.
Feel free to correct me or comment on the various approaches.

So far I still have no clear winner for my use case.



Basic challenges:

a) network outage

A network outage between the application servers and the remote Kafka cluster prevents the system from writing records into Kafka.
If these events are not persisted, a shutdown of an application server could cause significant loss of data.

Strategies against network outage:
- Persist data on a local machine/network and write to Kafka in a fault-tolerant way.

b) non-transactional writing to both database and Kafka

If we dual-write the events to both the database and Kafka in a non-atomic fashion, the data might diverge.

Strategies against diverging data:
- Passing events through the database so that the application writes only to the database and changes to the database are stored to Kafka in a fault-tolerant way.
- Passing events through Kafka so that the application writes only to Kafka and events are only written to the database when consumed from Kafka.
- Publish state repeatedly (instead of specific events) so that a mismatch between the database and Kafka consumers resolves itself.



Options:

a) Produce records directly within the application when modifying the data in the database

Pro:
- Simple implementation.
-- Does not slow down handling of requests, since requests to the Kafka cluster happen asynchronously in batches (unless otherwise configured).
-- Can handle retries and log failed records.
Con:
- Messages that are not persisted can be lost.
- Data can diverge between database and Kafka.


b) Produce records from the database via Kafka Connect JDBC connector

Pro:
- Mostly configuration/operation with little implementation.
- Consistent data in database and Kafka.
Con:
- Reduces performance of source database.
- May not be possible to deduce specific events from changes to database row.
- May not reduce coupling between consumers and database schema.


c) Produce records from the database via a Change Data Capture solution (GoldenGate, Attunity, Striim, …)

Pro:
- Does not put any load on the source database.
- Mostly configuration/operation with little implementation.
- Consistent data in database and Kafka.
Con:
- May not be possible to deduce specific events from changes to database row.
- May not reduce coupling between consumers and database schema.
- CDC solutions can be prohibitively expensive.


d) Persist events in log files and produce to Kafka via elastic Logstash/Filebeat

Pro:
- Built to handle log files.
- Distributed communication decoupled from application.
- Not affected by network outages / server shutdowns.
Con:
- Elastic Kafka features may not be geared towards fault-tolerance as much as Kafka connect.
- Data can diverge between database and Kafka.


e) Persist events in log files and produce to Kafka via Kafka Connect JDBC source connector

Pro:
- Distributed communication decoupled from application.
- Not affected by network outages / server shutdowns.
Con:
- Existing file source connectors apparently cannot handle log files that are continuously written to or log rotation.
- Writing a Kafka Connect connector may take considerable effort.
- Data can diverge between database and Kafka.


f) Persist events in a local, embedded database and produce to Kafka via an existing source connector

Pro:
- Can use existing Kafka Connect connector.
- Distributed communication decoupled from application.
- Not affected by network outages / server shutdowns.
Con:
- Data can diverge between database and Kafka (or can you do an atomic write across databases?).


g) Produce records directly within the application to a new Kafka cluster in the same network (or on the same machines) and mirror records to the remote cluster

Pro:
- Low risk of network issues compared to remote cluster (but Kafka cluster avaiability is still not guaranteed).
- Simple implementation.
-- Does not slow down handling of requests, since requests to the Kafka cluster happen asynchronously in batches (unless otherwise configured)
-- Can handle retries and log failed records.
Con:
- Messages that are not persisted can be lost.
- Data can diverge between database and Kafka.

See https://mail-archives.apache.org/mod_mbox/kafka-users/201709.mbox/browser


h) Combine techniques?

For example, produce records directly, but log errors and publish them later (via Kafka Connect or Logstash/Filebeat).

Pro:
- Reduce impact of cluster unavailability.
- Could use existing Kafka Connect connector to process infrequent error files.
Con:
- Two-pronged approach take more effort.
- Records from direct producer and Kafka Connect must match.
- Order of events might change.

See https://groups.google.com/forum/#!topic/confluent-platform/jbo4pNVoc84

i) ?


Regards,
Philip

P.S.:

@Hagen: Thank you. I'll keep that in mind if we go down that road :)


________________________________
From: Hagen Rother <ha...@liquidm.com>
Sent: Wednesday, September 13, 2017 10:17 PM
To: users@kafka.apache.org
Subject: Re: Reliably producing records to remote cluster: what are my options?

In my experience, 7 is the easiest route. Just make sure to run the
mirror-maker on the consumer side of the wan, it's order of magnitude
faster this way.

If you put
receive.buffer.bytes=33554432
send.buffer.bytes=33554432
in your consumer config and adjust the remote server.config to
socket.receive.buffer.bytes=33554432
socket.send.buffer.bytes=33554432

you can reliably mirror large volumes across the atlantic (we do). It would
be so much nicer to run the mirror-maker on the producer side of the wan
(enable compression in the mirror-maker and have compressed data on wan,
but cpu for that outside the hotpath; but like I said, that's order of
magnitude slower for unknown (but reproducable) reasons.

Cheers,
Hagen

On Tue, Sep 12, 2017 at 9:19 PM, Philip Schmitt <ph...@outlook.com>
wrote:

> Hi!
>
>
>
> We want to reliably produce events into a remote Kafka cluster in (mostly)
> near real-time. We have to provide an at-least-once guarantee.
>
> Examples are a "Customer logged in" event, that will be consumed by a data
> warehouse for reporting (numbers should be correct) or a "Customer
> unsubscribed from newsletter" event, that determines whether the customer
> gets emails (if she unsubscribes, but the message is lost, she will not be
> happy).
>
>
>
> Context:
>
>   *   We run an ecommerce website on a cluster of up to ten servers and an
> Oracle database.
>   *   We have a small Kafka cluster at a different site. We have in the
> past had a small number of network issues, where the web servers could not
> reach the other site for maybe an hour.
>   *   We don't persist all events in the database. If the application is
> restarted, events that occurred before the restart cannot be sent to Kafka.
> The row of a customer might have a newer timestamp, but we couldn't tell
> which columns were changed.
>
>
>
> Concerns:
>
>   *   In case of, for example, a network outage between the web servers
> and the Kafka cluster, we may accumulate thousands of events on each web
> server that cannot be sent to Kafka. If a server is shut down during that
> time, the messages would be lost.
>   *   If we produce to Kafka from within the application in addition to
> writing to the database, the data may become inconsistent if one of the
> writes fails.
>
>
>
>
>
> The more I read about Kafka, the more options I see, but I cannot assess,
> how well the options might work and what the trade-offs between the options
> are.
>
>
>
>   1.  produce records directly within the application
>   2.  produce records from the Oracle database via Kafka Connect
>   3.  produce records from the Oracle database via a CDC solution
> (GoldenGate, Attunity, Striim, others?)
>   4.  persist events in log files and produce to Kafka via elastic
> Logstash/Filebeat
>   5.  persist events in log files and produce to Kafka via a Kafka Connect
> source connector
>   6.  persist events in a local, embedded database and produce to Kafka
> via an existing source connector
>   7.  produce records directly within the application to a new Kafka
> cluster in the same network and mirror to remote cluster
>   8.  ?
>
>
>
> These are all the options I could gather so far. Some of the options
> probably won't work for my situation -- for example Oracle Golden Gate
> might be too expensive -- but I don't want to rule anything out just yet.
>
>
>
>
>
> How would you approach this, and why? Which options might work? Which
> options would you advise against?
>
>
>
>
> I appreciate any advice. Thank you in advance.
>
>
> Thanks,
>
> Philip
>



--
*Hagen Rother*
Lead Architect | LiquidM
------------------------------
LiquidM Technology GmbH
Rosenthaler Str. 36 | 10178 Berlin | Germany
Phone: +49 176 15 00 38 77
Internet: www.liquidm.com<http://www.liquidm.com> | LinkedIn
<http://www.linkedin.com/company/3488199?trk=tyah&trkInfo=tas%3AliquidM%2Cidx%3A1-2-2>
------------------------------
Managing Directors | André Bräuer, Philipp Simon, Thomas Hille
Jurisdiction | Local Court Berlin-Charlottenburg HRB 152426 B

Re: Reliably producing records to remote cluster: what are my options?

Posted by Hagen Rother <ha...@liquidm.com>.
In my experience, 7 is the easiest route. Just make sure to run the
mirror-maker on the consumer side of the wan, it's order of magnitude
faster this way.

If you put
receive.buffer.bytes=33554432
send.buffer.bytes=33554432
in your consumer config and adjust the remote server.config to
socket.receive.buffer.bytes=33554432
socket.send.buffer.bytes=33554432

you can reliably mirror large volumes across the atlantic (we do). It would
be so much nicer to run the mirror-maker on the producer side of the wan
(enable compression in the mirror-maker and have compressed data on wan,
but cpu for that outside the hotpath; but like I said, that's order of
magnitude slower for unknown (but reproducable) reasons.

Cheers,
Hagen

On Tue, Sep 12, 2017 at 9:19 PM, Philip Schmitt <ph...@outlook.com>
wrote:

> Hi!
>
>
>
> We want to reliably produce events into a remote Kafka cluster in (mostly)
> near real-time. We have to provide an at-least-once guarantee.
>
> Examples are a "Customer logged in" event, that will be consumed by a data
> warehouse for reporting (numbers should be correct) or a "Customer
> unsubscribed from newsletter" event, that determines whether the customer
> gets emails (if she unsubscribes, but the message is lost, she will not be
> happy).
>
>
>
> Context:
>
>   *   We run an ecommerce website on a cluster of up to ten servers and an
> Oracle database.
>   *   We have a small Kafka cluster at a different site. We have in the
> past had a small number of network issues, where the web servers could not
> reach the other site for maybe an hour.
>   *   We don't persist all events in the database. If the application is
> restarted, events that occurred before the restart cannot be sent to Kafka.
> The row of a customer might have a newer timestamp, but we couldn't tell
> which columns were changed.
>
>
>
> Concerns:
>
>   *   In case of, for example, a network outage between the web servers
> and the Kafka cluster, we may accumulate thousands of events on each web
> server that cannot be sent to Kafka. If a server is shut down during that
> time, the messages would be lost.
>   *   If we produce to Kafka from within the application in addition to
> writing to the database, the data may become inconsistent if one of the
> writes fails.
>
>
>
>
>
> The more I read about Kafka, the more options I see, but I cannot assess,
> how well the options might work and what the trade-offs between the options
> are.
>
>
>
>   1.  produce records directly within the application
>   2.  produce records from the Oracle database via Kafka Connect
>   3.  produce records from the Oracle database via a CDC solution
> (GoldenGate, Attunity, Striim, others?)
>   4.  persist events in log files and produce to Kafka via elastic
> Logstash/Filebeat
>   5.  persist events in log files and produce to Kafka via a Kafka Connect
> source connector
>   6.  persist events in a local, embedded database and produce to Kafka
> via an existing source connector
>   7.  produce records directly within the application to a new Kafka
> cluster in the same network and mirror to remote cluster
>   8.  ?
>
>
>
> These are all the options I could gather so far. Some of the options
> probably won't work for my situation -- for example Oracle Golden Gate
> might be too expensive -- but I don't want to rule anything out just yet.
>
>
>
>
>
> How would you approach this, and why? Which options might work? Which
> options would you advise against?
>
>
>
>
> I appreciate any advice. Thank you in advance.
>
>
> Thanks,
>
> Philip
>



-- 
*Hagen Rother*
Lead Architect | LiquidM
------------------------------
LiquidM Technology GmbH
Rosenthaler Str. 36 | 10178 Berlin | Germany
Phone: +49 176 15 00 38 77
Internet: www.liquidm.com | LinkedIn
<http://www.linkedin.com/company/3488199?trk=tyah&trkInfo=tas%3AliquidM%2Cidx%3A1-2-2>
------------------------------
Managing Directors | André Bräuer, Philipp Simon, Thomas Hille
Jurisdiction | Local Court Berlin-Charlottenburg HRB 152426 B