You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Mark Drago <ma...@gmail.com> on 2015/12/16 14:27:34 UTC

kafka-connect-jdbc: ids, timestamps, and transactions

I had asked this in a github issue but I'm reposting here to try and get an
answer from a wider audience.

Has any thought gone into how kafka-connect-jdbc will be impacted by SQL
transactions committing IDs and timestamps out-of-order?  Let me give an
example with two connections.

1: begin transaction
1: insert (get id 1)
2: begin transaction
2: insert (get id 2)
2: commit (recording id 2)
kafka-connect-jdbc runs and thinks it has handled everything through id 2
1: commit (recording id 1)

This would result in kafka-connect-jdbc missing id 1. The same thing could
happen with timestamps. I've read through some of the kafka-connect-jdbc
code and I think it may be susceptible to this problem, but I haven't run
it or verified that it would be an issue. Has this come up before? Are
there plans to deal with this situation?

Obviously something like bottled-water for postgresql would handle this
nicely as it would get the changes once they're committed.


Thanks for any insight,

Mark.


Original github issue:
https://github.com/confluentinc/kafka-connect-jdbc/issues/27

Re: kafka-connect-jdbc: ids, timestamps, and transactions

Posted by James Cheng <jc...@tivo.com>.
Mark, what database are you using?

If you are using MySQL...

<shameless plug>

There is a not-yet-finished Kafka MySQL Connector at https://github.com/wushujames/kafka-mysql-connector. It tails the MySQL binlog, and so will handle the situation you describe.

But, as I mentioned, I haven't finished it yet.

If you are using MySQL and don't specifically need/want Kafka Connect, then there are a bunch of other options. There is a list of them at https://github.com/wushujames/mysql-cdc-projects. But, I'd recommend using the Kafka Connect framework, since it was built for this exact purpose.

</shameless plug>

-James

> On Dec 18, 2015, at 12:08 PM, Mark Drago <ma...@gmail.com> wrote:
>
> Ewen,
>
> Thanks for the reply.  We'll proceed while keeping all of your points in
> mind.  I looked around for a more focused forum for the jdbc connector
> before posting here but didn't come across the confluent-platform group.
> I'll direct any more questions about the jdbc connector there.  I'll also
> close the github issue with a link to this thread.
>
> Thanks again,
> Mark.
>
> On Wed, Dec 16, 2015 at 9:51 PM Ewen Cheslack-Postava <ew...@confluent.io>
> wrote:
>
>> Mark,
>>
>> There are definitely limitations to using JDBC for change data capture.
>> Using a database-specific implementation, especially if you can read
>> directly off the database's log, will be able to handle more situations
>> like this. Cases like the one you describe are difficult to address
>> efficiently working only with simple queries.
>>
>> The JDBC connector offers a few different modes for handling incremental
>> queries. One of them uses both a timestamp and a unique ID, which will be
>> more robust to issues like these. However, even with both, you can still
>> come up with variants that can cause issues like the one you describe. You
>> also have the option of using a custom query which might help if you can do
>> something smarter by making assumptions about your table, but for now
>> that's pretty limited for constructing incremental queries since the
>> connector doesn't provide a way to track offset columns with custom
>> queries. I'd like to improve the support for this in the future, but at
>> some point it starts making sense to look at database-specific connectors.
>>
>> (By the way, this gets even messier once you start thinking about the
>> variety of different isolation levels people may be using...)
>>
>> -Ewen
>>
>> P.S. Where to ask these questions is a bit confusing since Connect is part
>> of Kafka. In general, for specific connectors I'd suggest asking on the
>> corresponding mailing list for the project, which in the case of the JDBC
>> connector would be the Confluent Platform mailing list here:
>> https://groups.google.com/forum/#!forum/confluent-platform
>>
>> On Wed, Dec 16, 2015 at 5:27 AM, Mark Drago <ma...@gmail.com> wrote:
>>
>>> I had asked this in a github issue but I'm reposting here to try and get
>> an
>>> answer from a wider audience.
>>>
>>> Has any thought gone into how kafka-connect-jdbc will be impacted by SQL
>>> transactions committing IDs and timestamps out-of-order?  Let me give an
>>> example with two connections.
>>>
>>> 1: begin transaction
>>> 1: insert (get id 1)
>>> 2: begin transaction
>>> 2: insert (get id 2)
>>> 2: commit (recording id 2)
>>> kafka-connect-jdbc runs and thinks it has handled everything through id 2
>>> 1: commit (recording id 1)
>>>
>>> This would result in kafka-connect-jdbc missing id 1. The same thing
>> could
>>> happen with timestamps. I've read through some of the kafka-connect-jdbc
>>> code and I think it may be susceptible to this problem, but I haven't run
>>> it or verified that it would be an issue. Has this come up before? Are
>>> there plans to deal with this situation?
>>>
>>> Obviously something like bottled-water for postgresql would handle this
>>> nicely as it would get the changes once they're committed.
>>>
>>>
>>> Thanks for any insight,
>>>
>>> Mark.
>>>
>>>
>>> Original github issue:
>>> https://github.com/confluentinc/kafka-connect-jdbc/issues/27
>>>
>>
>>
>>
>> --
>> Thanks,
>> Ewen
>>


________________________________

This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.

Re: kafka-connect-jdbc: ids, timestamps, and transactions

Posted by Mark Drago <ma...@gmail.com>.
Ewen,

Thanks for the reply.  We'll proceed while keeping all of your points in
mind.  I looked around for a more focused forum for the jdbc connector
before posting here but didn't come across the confluent-platform group.
I'll direct any more questions about the jdbc connector there.  I'll also
close the github issue with a link to this thread.

Thanks again,
Mark.

On Wed, Dec 16, 2015 at 9:51 PM Ewen Cheslack-Postava <ew...@confluent.io>
wrote:

> Mark,
>
> There are definitely limitations to using JDBC for change data capture.
> Using a database-specific implementation, especially if you can read
> directly off the database's log, will be able to handle more situations
> like this. Cases like the one you describe are difficult to address
> efficiently working only with simple queries.
>
> The JDBC connector offers a few different modes for handling incremental
> queries. One of them uses both a timestamp and a unique ID, which will be
> more robust to issues like these. However, even with both, you can still
> come up with variants that can cause issues like the one you describe. You
> also have the option of using a custom query which might help if you can do
> something smarter by making assumptions about your table, but for now
> that's pretty limited for constructing incremental queries since the
> connector doesn't provide a way to track offset columns with custom
> queries. I'd like to improve the support for this in the future, but at
> some point it starts making sense to look at database-specific connectors.
>
> (By the way, this gets even messier once you start thinking about the
> variety of different isolation levels people may be using...)
>
> -Ewen
>
> P.S. Where to ask these questions is a bit confusing since Connect is part
> of Kafka. In general, for specific connectors I'd suggest asking on the
> corresponding mailing list for the project, which in the case of the JDBC
> connector would be the Confluent Platform mailing list here:
> https://groups.google.com/forum/#!forum/confluent-platform
>
> On Wed, Dec 16, 2015 at 5:27 AM, Mark Drago <ma...@gmail.com> wrote:
>
> > I had asked this in a github issue but I'm reposting here to try and get
> an
> > answer from a wider audience.
> >
> > Has any thought gone into how kafka-connect-jdbc will be impacted by SQL
> > transactions committing IDs and timestamps out-of-order?  Let me give an
> > example with two connections.
> >
> > 1: begin transaction
> > 1: insert (get id 1)
> > 2: begin transaction
> > 2: insert (get id 2)
> > 2: commit (recording id 2)
> > kafka-connect-jdbc runs and thinks it has handled everything through id 2
> > 1: commit (recording id 1)
> >
> > This would result in kafka-connect-jdbc missing id 1. The same thing
> could
> > happen with timestamps. I've read through some of the kafka-connect-jdbc
> > code and I think it may be susceptible to this problem, but I haven't run
> > it or verified that it would be an issue. Has this come up before? Are
> > there plans to deal with this situation?
> >
> > Obviously something like bottled-water for postgresql would handle this
> > nicely as it would get the changes once they're committed.
> >
> >
> > Thanks for any insight,
> >
> > Mark.
> >
> >
> > Original github issue:
> > https://github.com/confluentinc/kafka-connect-jdbc/issues/27
> >
>
>
>
> --
> Thanks,
> Ewen
>

Re: kafka-connect-jdbc: ids, timestamps, and transactions

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.
Mark,

There are definitely limitations to using JDBC for change data capture.
Using a database-specific implementation, especially if you can read
directly off the database's log, will be able to handle more situations
like this. Cases like the one you describe are difficult to address
efficiently working only with simple queries.

The JDBC connector offers a few different modes for handling incremental
queries. One of them uses both a timestamp and a unique ID, which will be
more robust to issues like these. However, even with both, you can still
come up with variants that can cause issues like the one you describe. You
also have the option of using a custom query which might help if you can do
something smarter by making assumptions about your table, but for now
that's pretty limited for constructing incremental queries since the
connector doesn't provide a way to track offset columns with custom
queries. I'd like to improve the support for this in the future, but at
some point it starts making sense to look at database-specific connectors.

(By the way, this gets even messier once you start thinking about the
variety of different isolation levels people may be using...)

-Ewen

P.S. Where to ask these questions is a bit confusing since Connect is part
of Kafka. In general, for specific connectors I'd suggest asking on the
corresponding mailing list for the project, which in the case of the JDBC
connector would be the Confluent Platform mailing list here:
https://groups.google.com/forum/#!forum/confluent-platform

On Wed, Dec 16, 2015 at 5:27 AM, Mark Drago <ma...@gmail.com> wrote:

> I had asked this in a github issue but I'm reposting here to try and get an
> answer from a wider audience.
>
> Has any thought gone into how kafka-connect-jdbc will be impacted by SQL
> transactions committing IDs and timestamps out-of-order?  Let me give an
> example with two connections.
>
> 1: begin transaction
> 1: insert (get id 1)
> 2: begin transaction
> 2: insert (get id 2)
> 2: commit (recording id 2)
> kafka-connect-jdbc runs and thinks it has handled everything through id 2
> 1: commit (recording id 1)
>
> This would result in kafka-connect-jdbc missing id 1. The same thing could
> happen with timestamps. I've read through some of the kafka-connect-jdbc
> code and I think it may be susceptible to this problem, but I haven't run
> it or verified that it would be an issue. Has this come up before? Are
> there plans to deal with this situation?
>
> Obviously something like bottled-water for postgresql would handle this
> nicely as it would get the changes once they're committed.
>
>
> Thanks for any insight,
>
> Mark.
>
>
> Original github issue:
> https://github.com/confluentinc/kafka-connect-jdbc/issues/27
>



-- 
Thanks,
Ewen